TransferAttn: Transferable-guided Attention Is All You Need for Video Domain Adaptation

 André Sacilotti
Institute of Mathematics and Computer Science
University of São Paulo
[email protected]
& Samuel Felipe dos Santos
Dept. of Computing
Federal University of São Carlos
[email protected]
& Nicu Sebe
Dept. of Information Engineering and Computer Science
University of Trento
[email protected]
& Jurandy Almeida
Dept. of Computing
Federal University of São Carlos
[email protected]
Abstract

Unsupervised domain adaptation (UDA) in videos is a challenging task that remains not well explored compared to image-based UDA techniques. Although vision transformers (ViT) achieve state-of-the-art performance in many computer vision tasks, their use in video domain adaptation has still been little explored. Our key idea is to use the transformer layers as a feature encoder and incorporate spatial and temporal transferability relationships into the attention mechanism. A Transferable-guided Attention (TransferAttn) framework is then developed to exploit the capacity of the transformer to adapt cross-domain knowledge from different backbones. To improve the transferability of ViT, we introduce a novel and effective module named Domain Transferable-guided Attention Block (DTAB). DTAB compels ViT to focus on the spatio-temporal transferability relationship among video frames by changing the self-attention mechanism to a transferability attention mechanism. Extensive experiments on UCF-HMDB, Kinetics-Gameplay, and Kinetics-NEC Drone datasets with different backbones, like ResNet101, I3D, and STAM, verify the effectiveness of TransferAttn compared with state-of-the-art approaches. Also, we demonstrate that DTAB yields performance gains when applied to other state-of-the-art transformer-based UDA methods from both video and image domains. The code will be made freely available.

Keywords Action Recognition  \cdot Unsupervised Domain Adaptation  \cdot Adversarial Domain Adaptation

1 Introduction

With the popularization of social media platforms focused on user-generated content, a huge volume of data is generated, for instance, 720,000 of hours of video content is uploaded to YouTube daily 111https://www.demandsage.com/youtube-stats/ (As of February 29, 2024).. The cataloging and searching of this content is necessary, however, manually analyzing this immense amount of content is practically impossible, making video analysis tasks crucial.

Among the several video analysis tasks, action recognition is one of the most popular and challenging ones since there is a significant number of variations in the manner the action can be carried out and captured, for example, speed, duration, camera, and actor movement, occlusion, etc da Costa et al. (2022).

Various deep learning methods for action recognition are available in the literature. These approaches can be classified based on how they handle the temporal dimension. Some use 3D models to capture spatial and temporal features, while others treat spatial and temporal data separately or employ Recurrent Neural Networks (RNNs) to model the temporal dynamics Kong and Fu (2022). Despite all the advances, the temporal structure of videos still poses some challenges for training deep learning models Huang et al. (2018). Human costs are high, as many video annotations are needed to yield good results. Obtaining and annotating a desirable amount of data is difficult for many application domains, requiring significant human effort and specific knowledge Wang et al. (2017).

Unsupervised Domain Adaptation (UDA) can be used to reduce the cost of manually annotating data. In these strategies, the model is trained with labeled data from a source domain and unlabeled data from a target domain to perform well on the target domain’s test set. Since there is a domain change between source and target, UDA methods must deal with the distribution mismatch generated by the domain gap, since the domains might have different backgrounds, illumination, camera position, etc Chen et al. (2022). Several works have been proposed in the literature to address this issue, e.g., adversarial-based methods Ganin et al. (2017); Tzeng et al. (2017); Ganin and Lempitsky (2015), metric-based methods Ghifary et al. (2014); Long et al. (2017), and more recently, transformer-based methods Xu et al. (2021); Yang et al. (2023), achieving remarkable results. However, these methods are for image UDA, and video UDA is considerably less explored and is a more significant challenge, as it requires handling the temporal aspects of the data da Costa et al. (2022).

Only a few recent works Munro and Damen (2020); Chen et al. (2019); Yin et al. (2022); Wei et al. (2023); Turrisi da Costa et al. (2022); Dasgupta et al. (2023); Chen et al. (2022); Huang et al. (2022); da Costa et al. (2022); Li et al. (2023) tackle video UDA for action recognition using deep learning with strategies like contrastive learning, cross-domain attention mechanisms, self-supervised learning and multi-modalities of data. An amount even lower of works da Costa et al. (2022); Huang et al. (2022) explore transformer architectures.

In this work, we propose a novel method for video UDA in action recognition, Transferable-guided Attention (TransferAttn), which shows the potential of transformer architecture. Our method uses pre-trained frozen backbones to extract frame-by-frame features of the videos. A transformer encoder is used to reduce the domain gap and learn temporal relationships between frames. The encoder also includes our proposed transformer block, named Domain Transferable-guided Attention Block (DTAB), which introduces a new attention mechanism. Finally, we use two classification heads, one for classification and one for domain adaptation, that employ adversarial learning.

We evaluate our approach on three well-known video UDA benchmarks for action recognition, UCF \leftrightarrow HDMBfull Chen et al. (2019), Kinetics \rightarrow Gameplay Chen et al. (2019), and Kinetics \rightarrow NEC-Drone Choi et al. (2020), where we outperform the other state-of-the-art methods. We also integrated our proposed DTAB module into other state-of-the-art transformer architectures for UDA, showing that it was able to increase performance.

The main contributions of this paper are summarized as follows:

  • To the best of our knowledge, we are the first to present a backbone-independent transformer architecture on video UDA. Our empirical experiments showed the effectiveness of the transformer encoder in extracting fine-grained spatio-temporal transferable representations.

  • We propose DTAB, a novel transferable transformer block for UDA. Our method employs a new attention mechanism that improves adaptation and domain transferability. Also, we show the positive effect of applying the DTAB module to other state-of-the-art UDA methods for videos and images.

  • We conduct extensive experiments on several benchmarks, setting a new state-of-the-art result in three different cross-domain datasets. Also, our ablation study demonstrates the positive effect of each part of our methods.

2 Related Work

2.1 Video-based Action Recognition.

Action recognition methods have been extensively studied with the advent of deep learning, especially with the introduction of large-scale video datasets, such as Kinetics, Moments-In-Time, YouTube Sports 1M, and Youtube 8M Ji et al. (2013). BEAR Deng et al. (2023) states a new benchmark in action recognition, which is made to cover a diverse set of real-world applications. Deep learning CNN models can be divided into three categories according to how they model the temporal dimension Kong and Fu (2022): (1) space-time networks, (2) multi-stream networks, and (3) hybrid models. Space-time networks use 3D convolutions to maintain temporal information, inflating 2D kernels to 3D, like C3D Tran et al. (2015) and I3D Carreira and Zisserman (2017). Multi-stream networks employ different models to deal with spatial (usually RGB images) and motion information (usually optical flow), like the TSN Wang et al. (2018) that applies temporal sampling, and the TDN Wang et al. (2021) has modules to capture short-term and long-term (across segments) motion. Hybrid models integrate recurrent networks, like LSTMs Donahue et al. (2015); Yue-Hei Ng et al. (2015); Wu et al. (2015) and Temporal CNNs Ke et al. (2017), on top of the CNNs. Skeleton data, like body joint information, can also be utilized Shahroudy et al. (2016); Zhu et al. (2016); Liu et al. (2016); Ke et al. (2017) and recent works Yan et al. (2018); Si et al. (2019) show that graph convolution obtains superior performance to RNNs and Temporal CNNs on capturing information from joints Kong and Fu (2022). Kim et al. Kim et al. (2024) presents a novelty training approach to make models robust to distribution shifts.

2.2 Unsupervised Domain Adaptation.

Unsupervised domain adaptation (UDA) in the image domain has a wide range of strategies to address the domain shift. A standard option is the adversarial-based methods Ganin et al. (2017); Tzeng et al. (2017); Ganin and Lempitsky (2015); Lai et al. (2024), which use a domain discriminator while maximizing the feature extractor loss through a min-max optimization game, similar to Generative Adversarial Networks (GAN) Goodfellow et al. (2014) training, minimizing the domain gap. In addition, the metric-based methods aim to reduce the domain gap by learning domain-invariant features through discrepancy metrics, like Maximum Mean Discrepancy (MMD) Ghifary et al. (2014) and Joint Adaptation Networks (JAN) Long et al. (2017) that incorporate a loss metric computing the discrepancy between the domain features and aim to reduce that metric to minimize the domain shift. Driven by the success of Vision Transformers, CDTrans Xu et al. (2021) adopts a three-branch cross transformer that proves to be noisy-robust. On the other side, TVT Yang et al. (2023) employs a transferability metric as a weight into class token attention weight. Although TVT Yang et al. (2023) shows great results injecting transferability into the class token weight, it lacks two essential points: i) TVT Yang et al. (2023) does not use spatial relation transferability; ii) As an image UDA, it does not incorporate the temporal relation transferability.

2.3 Unsupervised Domain Adaptation for Action Recognition.

Although there are several possible applications of UDA for action recognition in real-world problems, only a limited number of recent studies have tackled this challenging task Munro and Damen (2020); Chen et al. (2019); Yin et al. (2022); Wei et al. (2023); Turrisi da Costa et al. (2022); Dasgupta et al. (2023); Chen et al. (2022); Li et al. (2023). TA3Chen et al. (2019) proposes a domain attention mechanism that focuses on the temporal dynamics of the videos. MA2LT-D Chen et al. (2022) generates multi-level temporal features with multiple domain discriminators. Level-wise attention weights are calculated by domain confusion and features are aggregated by attention determined by the domain discriminators. Other approaches use multiple modalities of data, like MM-SADA Munro and Damen (2020), where self-supervision among modalities is used, and MixDANN Yin et al. (2022), which dynamically estimates the most adaptable modality and uses it as a teacher to the others.

CleanAdapt Dasgupta et al. (2023) tackles the source-free video domain adaptation problem using a model pre-trained on the source domain to generate noisy labels for the target domain, and the likely correct ones are used to fine-tune the model. STHC Li et al. (2023) tackles the source-free domain using spatial and temporal augmentation. In a different approach, TranSVAE Wei et al. (2023) handles spatial and temporal domain divergence separately by constraining different sets of latent factors.

Although transformers can obtain state-of-the-art performance, only a few works for video UDA exist. UDAVT da Costa et al. (2022) is a recent work that leverages the STAM visual transformer Sharir et al. (2021) and proposes a domain alignment loss based on the Information Bottleneck (IB) principle to learn domain invariant features. Also, MTRAN Huang et al. (2022), which depends on 3D backbones, uses a transformer layer inspired by ViViT Arnab et al. (2021), where each token is a 16-frame clip representation. Although UDAVT and MTRAN show great results incorporating the transformer mechanism, the UDAVT architecture strictly depends on transformer backbones that deal separately with spatial and temporal relations, like STAM Sharir et al. (2021). At the same time, MTRAN is dependent on 3D backbones, and the attention relation is done on clip-level pooled features, lacking a more fine-grained frame-level relation. Also, none of them exploit ways to improve knowledge transferring in the transformer mechanism.

3 Our Approach

Figure 1 shows a simplified overview of our method. In Section 3.1, we first discuss the preliminaries and background on adversarial unsupervised domain adaptation and transformers, and in Section 3.2, we then detail our domain transferable-guided attention block, called DTAB, and its components.

3.1 Preliminaries and Background

Refer to caption
Figure 1: TransferAttn overview. The input video frames are fed into a fixed Backbone to extract frame-by-frame features, followed by a Clip Embedding to map frames into tokens. The embeddings are fed into a sequence of transformers to extract relevant transferable spatiotemporal information. The adaptation branch for adversarial domain discrimination uses fine-grained representations from the transformer encoder.

3.1.1 Network Overview.

The overall architecture consists of some components, including the backbone, patch embedding, encoder, classification head, and adversarial head, as shown in Figure 1. Given nssubscript𝑛𝑠n_{s}italic_n start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT labeled videos in the source domain and ntsubscript𝑛𝑡n_{t}italic_n start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT unlabeled videos in the target domain, within k𝑘kitalic_k sampled frames each, we defined the j𝑗jitalic_j-th frame from the i𝑖iitalic_i-th video as xi,jssubscriptsuperscript𝑥𝑠𝑖𝑗x^{s}_{i,j}italic_x start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT for the source domain and xi,jtsubscriptsuperscript𝑥𝑡𝑖𝑗x^{t}_{i,j}italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT for the target domain.

The backbone in our method (𝒢bsubscript𝒢𝑏\mathcal{G}_{b}caligraphic_G start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT) is fixed and not trained. The patch embedding (𝒢psubscript𝒢𝑝\mathcal{G}_{p}caligraphic_G start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT) is an MLP that maps the feature from the backbone to the transformer encoder input size. The transformer encoder (𝒢esubscript𝒢𝑒\mathcal{G}_{e}caligraphic_G start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT) comprises L1𝐿1L-1italic_L - 1 transformer layers and one DTAB module, with hhitalic_h attention heads and a hidden size of d𝑑ditalic_d.

Related to the adversarial head, the Gradient Reversal Layer (GRL), 𝒢grlsubscript𝒢𝑔𝑟𝑙\mathcal{G}_{grl}caligraphic_G start_POSTSUBSCRIPT italic_g italic_r italic_l end_POSTSUBSCRIPT, is used to invert the gradients, resulting in a min-max optimization, with weight as λ𝜆\lambdaitalic_λ, then, the discriminator, 𝒢Dsubscript𝒢𝐷\mathcal{G}_{D}caligraphic_G start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT, tries to discriminate whether the video originates from the source or the target domain. The classification head contains a classifier, 𝒢Csubscript𝒢𝐶\mathcal{G}_{C}caligraphic_G start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT. Unlike the discriminator, the classifier MLP is not trained to make it more robust to noise labels from the source domain, to avoid a projection that can overfit over the source domain feature, and to make the learning of action classes discrimination an encoder’s responsibility.

For convenience, we refer to the extracted features for the i𝑖iitalic_i-th video and the j𝑗jitalic_j-th frame from the source domain as Fi,jssubscriptsuperscript𝐹𝑠𝑖𝑗F^{s}_{i,j}italic_F start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT, and Fi,jtsubscriptsuperscript𝐹𝑡𝑖𝑗F^{t}_{i,j}italic_F start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT for the target domain.

3.1.2 Transformer Encoder.

This work aims to train our Transformer Encoder to align the data distribution, reduce the domain gap, and improve the temporal relation information between the frames. Our method does not rely on the CLS token for the transformer encoder. Instead, we use the patches’ Global Average Pooling (GAP). For convenience, we define the features from the Transformer Encoder as fissuperscriptsubscript𝑓𝑖𝑠f_{i}^{s}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT and fitsuperscriptsubscript𝑓𝑖𝑡f_{i}^{t}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT for the source and target domains, respectively, as shown in Equations 1 2, i.e..

fis=GAP(𝒢e(𝒢p(Fi,1s,Fi,2s,,Fi,ks)))superscriptsubscript𝑓𝑖𝑠𝐺𝐴𝑃subscript𝒢𝑒subscript𝒢𝑝superscriptsubscript𝐹𝑖1𝑠superscriptsubscript𝐹𝑖2𝑠superscriptsubscript𝐹𝑖𝑘𝑠f_{i}^{s}=GAP(\mathcal{G}_{e}(\mathcal{G}_{p}(F_{i,1}^{s},F_{i,2}^{s},...,F_{i% ,k}^{s})))italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT = italic_G italic_A italic_P ( caligraphic_G start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ( caligraphic_G start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( italic_F start_POSTSUBSCRIPT italic_i , 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT , italic_F start_POSTSUBSCRIPT italic_i , 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT , … , italic_F start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ) ) ) (1)
fit=GAP(𝒢e(𝒢p(Fi,1t,Fi,2t,,Fi,kt)))superscriptsubscript𝑓𝑖𝑡𝐺𝐴𝑃subscript𝒢𝑒subscript𝒢𝑝superscriptsubscript𝐹𝑖1𝑡superscriptsubscript𝐹𝑖2𝑡superscriptsubscript𝐹𝑖𝑘𝑡f_{i}^{t}=GAP(\mathcal{G}_{e}(\mathcal{G}_{p}(F_{i,1}^{t},F_{i,2}^{t},...,F_{i% ,k}^{t})))italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = italic_G italic_A italic_P ( caligraphic_G start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ( caligraphic_G start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( italic_F start_POSTSUBSCRIPT italic_i , 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_F start_POSTSUBSCRIPT italic_i , 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , … , italic_F start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ) ) (2)

3.1.3 Classification Head.

This branch from the network is a classifier 𝒢Csubscript𝒢𝐶\mathcal{G}_{C}caligraphic_G start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT, which trains the transformer encoder 𝒢esubscript𝒢𝑒\mathcal{G}_{e}caligraphic_G start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT to minimize the cross entropy clssubscript𝑐𝑙𝑠\mathcal{L}_{cls}caligraphic_L start_POSTSUBSCRIPT italic_c italic_l italic_s end_POSTSUBSCRIPT within the source domain data and minimize the soft entropy Hsubscript𝐻\mathcal{L}_{H}caligraphic_L start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT within the target data. In Equations 3 4, we define the cross-entropy loss and the soft-entropy loss, respectively, i.e..

cls=1nsi=1nsyilog𝒢C(fis)subscript𝑐𝑙𝑠1subscript𝑛𝑠subscriptsuperscriptsubscript𝑛𝑠𝑖1subscript𝑦𝑖subscript𝒢𝐶superscriptsubscript𝑓𝑖𝑠\mathcal{L}_{cls}=-\frac{1}{n_{s}}\sum^{n_{s}}_{i=1}y_{i}\cdot\log\mathcal{G}_% {C}(f_{i}^{s})caligraphic_L start_POSTSUBSCRIPT italic_c italic_l italic_s end_POSTSUBSCRIPT = - divide start_ARG 1 end_ARG start_ARG italic_n start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ roman_log caligraphic_G start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ) (3)
H=1nti=1nt𝒢C(fit)log𝒢C(fit)subscript𝐻1subscript𝑛𝑡subscriptsuperscriptsubscript𝑛𝑡𝑖1subscript𝒢𝐶superscriptsubscript𝑓𝑖𝑡subscript𝒢𝐶superscriptsubscript𝑓𝑖𝑡\mathcal{L}_{H}=-\frac{1}{n_{t}}\sum^{n_{t}}_{i=1}\mathcal{G}_{C}(f_{i}^{t})% \cdot\log\mathcal{G}_{C}(f_{i}^{t})caligraphic_L start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT = - divide start_ARG 1 end_ARG start_ARG italic_n start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT caligraphic_G start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ⋅ roman_log caligraphic_G start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) (4)

3.1.4 Adaptation Head.

This branch of the network is a simple MLP with the Gradient Reversal Layer (GRL), which trains the discriminator 𝒢Dsubscript𝒢𝐷\mathcal{G}_{D}caligraphic_G start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT to identify if the video is from the source or target domain and, at the same time, it trains the encoder 𝒢esubscript𝒢𝑒\mathcal{G}_{e}caligraphic_G start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT to confuse the discriminator 𝒢Dsubscript𝒢𝐷\mathcal{G}_{D}caligraphic_G start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT, a min-max game and the overall loss. The advsubscript𝑎𝑑𝑣\mathcal{L}_{adv}caligraphic_L start_POSTSUBSCRIPT italic_a italic_d italic_v end_POSTSUBSCRIPT is defined in Equation 5, and, for convenience, we define bsubscript𝑏\mathcal{L}_{b}caligraphic_L start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT as a binary cross entropy loss, i.e..

adv=1ninb(𝒢D(𝒢grl(fi)),d)subscript𝑎𝑑𝑣1𝑛superscriptsubscript𝑖𝑛subscript𝑏subscript𝒢𝐷subscript𝒢𝑔𝑟𝑙subscript𝑓𝑖𝑑\mathcal{L}_{adv}=-\frac{1}{n}\sum_{i}^{n}\mathcal{L}_{b}(\mathcal{G}_{D}(% \mathcal{G}_{grl}(f_{i})),d)caligraphic_L start_POSTSUBSCRIPT italic_a italic_d italic_v end_POSTSUBSCRIPT = - divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ( caligraphic_G start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT ( caligraphic_G start_POSTSUBSCRIPT italic_g italic_r italic_l end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) , italic_d ) (5)

3.2 DTAB: Domain Transferable-guided Attention Block

In this section, we describe our Domain Transferable-guided Attention Block (DTAB), which uses transferable attention to calculate a weight representing the transferability of each patch, considering the spatio-temporal relation dynamics from the video data.

3.2.1 MDTA: Multi-head Domain Transferable-guided Attention.

Before exploiting our proposed method, we highlight the self-attention mechanism Vaswani et al. (2017), which captures long-range dependencies. The mechanism computes this long-term dependency through the dot products between a set of query vectors (𝐐𝐐\mathbf{Q}bold_Q) and a set of key vectors (𝐊𝐊\mathbf{K}bold_K) and weights the value vectors (𝐕𝐕\mathbf{V}bold_V), as shown in Equation 6.

SA(Q,K,V)=softmax(QKTd)V𝑆𝐴𝑄𝐾𝑉𝑠𝑜𝑓𝑡𝑚𝑎𝑥𝑄superscript𝐾𝑇𝑑𝑉SA(Q,K,V)=softmax\left(\frac{QK^{T}}{\sqrt{d}}\right)Vitalic_S italic_A ( italic_Q , italic_K , italic_V ) = italic_s italic_o italic_f italic_t italic_m italic_a italic_x ( divide start_ARG italic_Q italic_K start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG italic_d end_ARG end_ARG ) italic_V (6)

Figure 2 demonstrates our proposed attention mechanism. This mechanism involves using a domain discriminator designed to binary classify every patch as belonging to the source or target domain, and the error from this discrimination composes a weight that measures the transferability of each patch. The dot product between the discrimination error produces a transferability metric related to a frame-per-frame relation, bringing temporal information. Also, integrating the method within the multi-head mechanism is responsible for considering the temporal relation between different spatial representations of the frames.

Refer to caption
Figure 2: MDTA overview. The Multi-head Domain Transferable-guided Attention does the dot product between the query and key discrimination error, resulting in a transferability matrix within spatiotemporal relation.

In Equation 7, we define the Domain Transferable-guided Attention (DTA), which does the dot product between the discrimination error from Q𝑄Qitalic_Q and K𝐾Kitalic_K vectors, resulting in a transferability matrix that defines which patches or frames are more or less transferable than other considering the long-term temporal relation between the frames and, for convenience, we define WiQsubscriptsuperscript𝑊𝑄𝑖W^{Q}_{i}italic_W start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, WiKsubscriptsuperscript𝑊𝐾𝑖W^{K}_{i}italic_W start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, WiVsubscriptsuperscript𝑊𝑉𝑖W^{V}_{i}italic_W start_POSTSUPERSCRIPT italic_V end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT as the projection of different heads, WOsuperscript𝑊𝑂W^{O}italic_W start_POSTSUPERSCRIPT italic_O end_POSTSUPERSCRIPT a projection of the concatenation and dh=dmodelhsubscript𝑑subscript𝑑𝑚𝑜𝑑𝑒𝑙d_{h}=\frac{d_{model}}{h}italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT = divide start_ARG italic_d start_POSTSUBSCRIPT italic_m italic_o italic_d italic_e italic_l end_POSTSUBSCRIPT end_ARG start_ARG italic_h end_ARG. In other words, if a patch DDE goes to one, it is more likely to confuse the discriminator because its features are not easy to discriminate and should have more value when classifying the video action. Also, in Equation 8, we define the Multi-head Domain Transferable-guided Attention (MDTA), which incorporates the information from different token subspace representations. The purpose of the Gradient Reversal Layer (GRL) is to avoid the discriminator overfiting from classification head back-propagation.

DTAi(Q,K,V)=softmax(DDE(𝒢D(QWiQ))DDE(𝒢D(KWiK))Tdh)VWiV𝐷𝑇subscript𝐴𝑖𝑄𝐾𝑉𝑠𝑜𝑓𝑡𝑚𝑎𝑥𝐷𝐷𝐸subscript𝒢superscript𝐷𝑄subscriptsuperscript𝑊𝑄𝑖𝐷𝐷𝐸superscriptsubscript𝒢superscript𝐷𝐾subscriptsuperscript𝑊𝐾𝑖𝑇subscript𝑑𝑉subscriptsuperscript𝑊𝑉𝑖\begin{split}DTA_{i}(Q,K,V)=\\ softmax\left(\frac{{DDE}(\mathcal{G}_{D^{\prime}}(QW^{Q}_{i}))\cdot{DDE}(% \mathcal{G}_{D^{\prime}}(KW^{K}_{i}))^{T}}{\sqrt{d_{h}}}\right)VW^{V}_{i}\end{split}start_ROW start_CELL italic_D italic_T italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_Q , italic_K , italic_V ) = end_CELL end_ROW start_ROW start_CELL italic_s italic_o italic_f italic_t italic_m italic_a italic_x ( divide start_ARG italic_D italic_D italic_E ( caligraphic_G start_POSTSUBSCRIPT italic_D start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_Q italic_W start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) ⋅ italic_D italic_D italic_E ( caligraphic_G start_POSTSUBSCRIPT italic_D start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_K italic_W start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_ARG end_ARG ) italic_V italic_W start_POSTSUPERSCRIPT italic_V end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_CELL end_ROW (7)
MDTA(Q,K,V)=concat({DTAi}i=1h)WO𝑀𝐷𝑇𝐴𝑄𝐾𝑉𝑐𝑜𝑛𝑐𝑎𝑡superscriptsubscript𝐷𝑇subscript𝐴𝑖𝑖1subscript𝑊𝑂MDTA(Q,K,V)=concat(\{DTA_{i}\}_{i=1}^{h})W_{O}italic_M italic_D italic_T italic_A ( italic_Q , italic_K , italic_V ) = italic_c italic_o italic_n italic_c italic_a italic_t ( { italic_D italic_T italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ) italic_W start_POSTSUBSCRIPT italic_O end_POSTSUBSCRIPT (8)

For convenience, we defined the domain discriminator as 𝒢Dsubscript𝒢superscript𝐷\mathcal{G}_{D^{\prime}}caligraphic_G start_POSTSUBSCRIPT italic_D start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT. Also, we defined the Domain Discriminator Error, DDE𝐷𝐷𝐸DDEitalic_D italic_D italic_E, as a binary cross entropy, as shown in Equation 9.

DDE(x)={logx,if x is from sourcelog(1x),otherwise𝐷𝐷𝐸𝑥cases𝑥if x is from source1𝑥otherwiseDDE(x)=\begin{cases}\log x,&\text{if x is from source}\\ \log(1-x),&\text{otherwise}\end{cases}italic_D italic_D italic_E ( italic_x ) = { start_ROW start_CELL roman_log italic_x , end_CELL start_CELL if x is from source end_CELL end_ROW start_ROW start_CELL roman_log ( 1 - italic_x ) , end_CELL start_CELL otherwise end_CELL end_ROW (9)

In short, integrating the multi-head mechanism with the new attention mechanism brings spatiotemporal information to give more weight to frames or patches that are likely to confuse the domain discriminator.

Refer to caption
Figure 3: DTAB overview. The Domain Transferable-guided Attention Block follows a standard transformer block layout, except for the new attention mechanism and the layer-wise information bottleneck calculation.

3.2.2 Information Bottleneck.

The Information Bottleneck (IB) principle has been widely adopted in self-supervised based frameworks Kim et al. (2020); Park et al. (2020) to learn better feature representation. Turrisi da Costa et al. da Costa et al. (2022) proposed a novelty method that employs the IB principle to align the domain features through contrastive learning. This method establishes a loss function as shown in Equation 10 that uses a cross-correlation matrix defined as Ci,j=bBzi,bzj,bbB(zi,b)2bB(zj,b)2subscript𝐶𝑖𝑗subscriptsuperscript𝐵𝑏subscript𝑧𝑖𝑏subscriptsuperscript𝑧𝑗𝑏subscriptsuperscript𝐵𝑏superscriptsubscript𝑧𝑖𝑏2subscriptsuperscript𝐵𝑏superscriptsubscriptsuperscript𝑧𝑗𝑏2C_{i,j}=\frac{\sum^{B}_{b}z_{i,b}\cdot z^{{}^{\prime}}_{j,b}}{\sqrt{\sum^{B}_{% b}(z_{i,b})^{2}}\sqrt{\sum^{B}_{b}(z^{{}^{\prime}}_{j,b})^{2}}}italic_C start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT = divide start_ARG ∑ start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT italic_i , italic_b end_POSTSUBSCRIPT ⋅ italic_z start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j , italic_b end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG ∑ start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_i , italic_b end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG square-root start_ARG ∑ start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ( italic_z start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j , italic_b end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG, between the pairs of source representations zissubscriptsuperscript𝑧𝑠𝑖{z}^{s}_{i}italic_z start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and target features zjssubscriptsuperscript𝑧𝑠𝑗{z}^{s}_{j}italic_z start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT from a batch of B𝐵Bitalic_B over a total of d𝑑ditalic_d features. Also, Turrisi da Costa et al. da Costa et al. (2022) propose a queue to keep stored recent source features and use them to increase the number of pairs.

Lib=i=1d(1Ci,i)2+λi=1djid(Ci,j)2subscript𝐿𝑖𝑏subscriptsuperscript𝑑𝑖1superscript1subscript𝐶𝑖𝑖2𝜆subscriptsuperscript𝑑𝑖1subscriptsuperscript𝑑𝑗𝑖superscriptsubscript𝐶𝑖𝑗2L_{ib}=\sum^{d}_{i=1}(1-C_{i,i})^{2}+\lambda\sum^{d}_{i=1}\sum^{d}_{j\neq i}(C% _{i,j})^{2}italic_L start_POSTSUBSCRIPT italic_i italic_b end_POSTSUBSCRIPT = ∑ start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT ( 1 - italic_C start_POSTSUBSCRIPT italic_i , italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_λ ∑ start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT ∑ start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j ≠ italic_i end_POSTSUBSCRIPT ( italic_C start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (10)

3.2.3 Transformer Block.

Figure 3 shows our Domain Transferable-guided Attention Block (DTAB) that integrates our attention mechanism with the Queued Information Bottleneck da Costa et al. (2022). We maintain the transformer structure Vaswani et al. (2017), changing the MSA to MDTA and adding the information bottleneck loss after the last residual connection. In our work, the DTAB attention mechanism weights the spatiotemporal relation transferability between the frames. At the same time, the IB da Costa et al. (2022) integrates a loss that minimizes the discrepancy between the residual features from MDTA.

4 Experiments and Results

To exploit the effectiveness of our method, we proposed several studies on multiple datasets from video domain adaptation against the classical and state-of-the-art methods.

4.1 Datasets

We compare our proposed method with the state-of-the-art in three different benchmarks.

UCF \leftrightarrow HDMBfull Chen et al. (2019) is one of the most widely used, containing a subset of videos from two public datasets, UCF101 Soomro et al. (2012) and HMDB51 Kuehne et al. (2011), representing a total of 3209 videos and 12 classes.

Kinetics \rightarrow Gameplay Chen et al. (2019) is a non-public dataset that contains a subset of videos from the well-known Kinetics-400 Kay et al. (2017) and a private gameplay dataset Chen et al. (2019). The dataset contains 49998 videos and 30 classes.

Kinetics \rightarrow NEC-Drone Choi et al. (2020) is a public dataset that contains videos from Kinetics-600 Carreira et al. (2018) and NEC-Drone. The dataset contains 10118 videos and 7 classes. For a fair comparison, we used the cropped version of NEC-Drone da Costa et al. (2022) that focused on the actors in the action.

4.2 Experimental Setup

For extracting features from the videos, we studied three different backbones, ResNet101 He et al. (2016), I3D Carreira and Zisserman (2017), and STAM Transformer Sharir et al. (2021), all of them pretrained. Our method depends on frame-level features, so for the 3D backbones, like STAM Sharir et al. (2021) and I3D Carreira and Zisserman (2017), we densely slide a temporal window of 16 frames along the videos, so, for a frame k in the video, the window use frames from k7𝑘7k-7italic_k - 7 to k+8𝑘8k+8italic_k + 8 with zero-pad in the beginning and the end of the video.

Our transformer encoder comprises 4444 blocks, whereas our Domain Transferable-guided Attention Block (DTAB) is the last, and every transformer block is composed within h=88h=8italic_h = 8 attention heads with a hidden size of dmodel=512subscript𝑑𝑚𝑜𝑑𝑒𝑙512d_{model}=512italic_d start_POSTSUBSCRIPT italic_m italic_o italic_d italic_e italic_l end_POSTSUBSCRIPT = 512. We used the ADAM optimizer for the training schedule with a weight decay of 51045superscript1045\cdot 10^{-4}5 ⋅ 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT and a learning rate of 31053superscript1053\cdot 10^{-5}3 ⋅ 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT for 300300300300 epochs. Also, the sampling strategy is to divide the action videos into k segments and then randomly sample one frame for each segment. We present the adversarial training, DTAB hyperparameters and sampling parameters in the supplementary material. The implementation is available based on the Pytorch framework, and the source code is released here. Also, For the sake of reproducibility, we release the extracted features for the public datasets for each backbone.

4.3 Comparisons to the State-of-the-art Methods

In this section, we evaluate our method in three different video domain adaptation datasets and compare them with different methods from the literature.

4.3.1 Results on UCF101\leftrightarrowHMDB51full.

As shown in Table 1, we compare our results using three different backbones. The first one is ResNet101, a 2D backbone, in which our method achieves a significant average increase of 2.4%percent2.42.4\%2.4 %, resulting in 88.2%percent88.288.2\%88.2 %. This accuracy surpasses some methods that use the I3D backbone, showing that even with a 2D backbone, our method can extract significant spatiotemporal information.

Looking at methods that use an I3D backbone, our method surpasses even works that use multi-modal data (RGB and Flow). From single-modal data, our method achieves a significant average increase of 3.5%percent3.53.5\%3.5 % and, compared to multi-modal methods, can be seen as an average increase of 0.4%percent0.40.4\%0.4 %. Finally, for STAM Transformer, it can be seen in Table 1 that our method achieved a significant average increase of 0.9%percent0.90.9\%0.9 %.

Table 1: Classification Accuracy on UCF101\leftrightarrow HMDB51full. Multi-modal methods are represented with (R + F).
Method Backbone U \rightarrow H H \rightarrow U Average
Source Only ResNet101 73.9 71.7 72.8
TA3superscript𝐴3A^{3}italic_A start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPTChen et al. (2019) 78.3 81.8 80.1
MA2superscript𝐴2A^{2}italic_A start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPTLT-D Chen et al. (2022) 85.0 86.6 85.8
TransferAttn (ours) 88.1 88.3 88.2
Source Only 82.2 88.1 85.2
STHC Li et al. (2023) TRN 90.9 92.1 91.5
TranSVAE Wei et al. (2023) 92.2 96.5 94.3
TransferAttn (ours) 93.5 97.1 95.3
Source Only I3D 80.6 89.3 85.0
STCDA (R + F) Song et al. (2021) 83.1 92.1 87.7
CycDA Lin et al. (2022) 88.1 90.0 89.1
MA2superscript𝐴2A^{2}italic_A start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPTLT-D Chen et al. (2022) 89.3 91.2 90.3
CoMix Sahoo et al. (2021) 86.7 93.9 90.3
CO2superscript𝑂2O^{2}italic_O start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPTTurrisi da Costa et al. (2022) 87.8 95.8 91.8
CIA (R + F) Yang et al. (2022) 91.9 94.6 93.3
TranSVAE Wei et al. (2023) 87.8 99.0 93.4
MTRAN (R + F) Huang et al. (2022) 92.2 95.3 93.8
CleanAdapt (R + F) Dasgupta et al. (2023) 93.6 99.3 96.5
TransferAttn (ours) 94.4 99.4 96.9
Source Only STAM 86.9 93.7 90.3
TranSVAE Wei et al. (2023) 93.5 99.5 96.5
UDAVT da Costa et al. (2022) 92.3 96.8 94.6
MA2superscript𝐴2A^{2}italic_A start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPTLT-D Chen et al. (2022) 95.3 99.4 97.4
TransferAttn (ours) 97.2 99.7 98.5

4.3.2 Results on Kinetics\rightarrowGameplay.

We evaluate our method in the task Kinetics\rightarrow Gameplay, as shown in Table 2. Due to being a dataset that does not give access to the raw videos, just to the frame-level features from ResNet101, we could not provide results for methods that the backbone is not fixed in the training stage or relies on video-level features. As shown in Table 2, our method achieves a significant increase of 6.5%percent6.56.5\%6.5 % in accuracy compared with the state-of-the-art MA2superscript𝐴2A^{2}italic_A start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPTLT-D and is better than all prior methods.

Table 2: Classification Accuracy on Kinetics \rightarrow Gameplay dataset.
Method Backbone K \rightarrow G
Source Only ResNet101 17.6
TA3superscript𝐴3A^{3}italic_A start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPTChen et al. (2019) 27.5
MA2superscript𝐴2A^{2}italic_A start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPTLT-D Chen et al. (2022) 31.5
TranSVAE Wei et al. (2023) 21.9
TransferAttn (ours) 37.0

4.3.3 Results on Kinetics\rightarrowNEC-Drone.

Our method was also evaluated in the Kinetics \rightarrow NEC-Drone task, as we can see in Table 3, achieving a significant increase of 9.5%percent9.59.5\%9.5 % in comparison with UDAVT, establishing a new SOA result.

Table 3: Classification Accuracy on Kinetics \rightarrow NEC-Drone dataset.
Method Backbone K \rightarrow N
Source Only STAM 29.4
MA2superscript𝐴2A^{2}italic_A start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPTLT-D Chen et al. (2022) 55.4
TranSVAE Wei et al. (2023) 55.9
UDAVT da Costa et al. (2022) 65.3
TransferAttn (ours) 74.8
Refer to caption
# Components K \rightarrow N
a) Standard Transformer 45.5
b) + MDTA 54.8
c) + IB 58.3
d) DTAB (MDTA + IB) 74.8
Figure 4: Ablation study on Kinetics \rightarrow NEC-Drone integrating each component of DTAB separately in comparison with standard transformer. Left: The t-SNE plots for class-wise features. Right: The accuracy result of each component.

4.4 Ablation Study

This subsection provides an ablation for the main choices of our method, the effect of each loss function and we show how our new transformer block (DTAB) can improve the results on other transformer UDA methods for action recognition and image classification.

4.4.1 Effect of DTAB components.

To learn the individual contributions between the components that integrate our Domain Transferable-guided Attention Block (DTAB) in improving the capability of domain transferability, we proposed a study to evaluate how each component of the DTAB impacts the final result. Figure 4 summarizes the results, showing the accuracy from Kinetics \rightarrow NEC-Drone dataset and the t-SNE visualization.

Changing the Self-Attention (MSA) mechanism to our Domain Transferable Attention (MDTA) in the last transformer block increases the accuracy in 9.3%percent9.39.3\%9.3 %, whereas using the IB mechanism in the last transformer with the self-attention mechanism solely increases the accuracy in 12.8%percent12.812.8\%12.8 %. Finally, integrating all the components, we achieve a significant increase of 29.3%percent29.329.3\%29.3 %, indicating the significance of both components in reducing the domain gap. Also, the t-SNE plot clearly shows how each component impacts the clusterization of the low-dimensional features from different classes, and it is easy to see that the combination of all the components makes features from the same class better clustered while different classes are farther away.

4.4.2 Effect of integrating each loss component.

We conduct an ablation study by integrating each loss function, clsssubscriptsuperscript𝑠𝑐𝑙𝑠\mathcal{L}^{s}_{cls}caligraphic_L start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c italic_l italic_s end_POSTSUBSCRIPT, Htsubscriptsuperscript𝑡𝐻\mathcal{L}^{t}_{H}caligraphic_L start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT, advsubscript𝑎𝑑𝑣\mathcal{L}_{adv}caligraphic_L start_POSTSUBSCRIPT italic_a italic_d italic_v end_POSTSUBSCRIPT, ibsubscript𝑖𝑏\mathcal{L}_{ib}caligraphic_L start_POSTSUBSCRIPT italic_i italic_b end_POSTSUBSCRIPT. As seen in Table 4, adding a new component gradually improves the domain alignment, reducing the domain gap, and the best result is achieved when we use all components.

Table 4: Loss integration studies with accuracy on Kinetics \rightarrow NEC-Drone dataset.
clsssubscriptsuperscript𝑠𝑐𝑙𝑠\mathcal{L}^{s}_{cls}caligraphic_L start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c italic_l italic_s end_POSTSUBSCRIPT Htsubscriptsuperscript𝑡𝐻\mathcal{L}^{t}_{H}caligraphic_L start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT advsubscript𝑎𝑑𝑣\mathcal{L}_{adv}caligraphic_L start_POSTSUBSCRIPT italic_a italic_d italic_v end_POSTSUBSCRIPT ibsubscript𝑖𝑏\mathcal{L}_{ib}caligraphic_L start_POSTSUBSCRIPT italic_i italic_b end_POSTSUBSCRIPT K \rightarrow N
21.6
27.9
54.8
64.9
74.8

4.4.3 Effects of DTAB to other transformer VUDA methods.

Another significant ablation to study is the effect of our new transformer block, DTAB, on other transformer VUDA methods, and, for a fair comparison, we studied the effects of TVT Yang et al. (2023) attention mechanism (TAM). For this study, we selected the UDAVT and the task UCF\rightarrowHMDBfull, and we reported the result obtained by our reproduction using the author code222https://github.com/vturrisi/UDAVT. The results in Table 5 show that adding our new transformer block increased the task accuracy by 1.7%percent1.71.7\%1.7 %, while the TVT Yang et al. (2023) attention mechanism (TAM) resulted in a slightly better accuracy. This result shows that our DTAB module can integrate with other transformer methods in action recognition domain adaptation, helps reduce the domain gap, and deals better with spatiotemporal information than the TVT mechanism. Due to its source-free characteristics, we could not evaluate DTAB on MTRAN Huang et al. (2022) in this ablation.

Table 5: Classification accuracy in HMDB-UCFfull dataset integrating DTAB on state-of-the-art transformer video unsupervised domain adaptation architectures
Method Backbone U \rightarrow H H \rightarrow U Average
UDAVT (Our implementation) da Costa et al. (2022) STAM 92.2 96.5 94.4
UDAVT + TAM Yang et al. (2023) 92.5 96.9 94.7
UDAVT + DTAB 94.2 97.9 96.1

4.4.4 Effects of DTAB to image transformers UDA method.

We also conducted experiments adapting the TVT and CDTrans methods to use our DTAB module. The results reported in Table 6 were obtained through our reproduction using the author’s code333https://github.com/uta-smile/TVT/,444https://github.com/CDTrans/CDTrans/ with the Office-31 datasets. The results show a slight increase of 1.1%percent1.11.1\%1.1 % in the accuracy of the TVT Yang et al. (2023), while in the CDTrans Xu et al. (2021) we observe an increase of 0.8%percent0.80.8\%0.8 % in accuracy.

Table 6: Classification accuracy in Office-31 dataset integrating DTAB on state-of-the-art image unsupervised domain adaptation transformer architectures
Method Backbone A \rightarrow W D \rightarrow W A \rightarrow D D \rightarrow A W \rightarrow A Avg.
TVT (Our impl.) Yang et al. (2023) ViT-B 16 93.6 98.7 93.7 79.4 78.7 88.8
TVT + DTAB 94.7 99.9 95.0 80.3 79.5 89.9
CDTrans (Our impl.) Xu et al. (2021) DeiT-S 93.5 98.2 94.0 77.7 77.0 88.1
CDTrans + DTAB 94.7 98.4 95.6 77.7 78.1 88.9

4.4.5 Complexity Analysis.

We also conduct a complexity analysis on our TransferAttn architecture. The study compares some baseline methods regarding the amount of Trainable Parameters (#Parameters) and the floating point operations (GFLOPs). We included the I3D backbone FLOPs for all the reported methods except for the UDAVT, which strictly uses the STAM as the backbone. As shown in Table 7, our method has slightly more trainable parameters than TranSVAE, compensated by the significant improvement in domain adaptation reported in Section 4.3. Compared with other transformer methods, like UDAVT, our method has expressive less training parameters and floating point operations with a significant improvement in domain adaptation, indicating the efficiency of our transformer architecture.

Table 7: Comparison on Model Complexity
Methods ##\##Parameters GFLOPs
CoMix 30.4 M 37.1
TranSVAE 12.7 M 36.5
MA2superscript𝐴2A^{2}italic_A start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPTLT-D 49.7 M 51.2
UDAVT 118.9 M 524.7
TransferAttn (ours) 16.3 M 55.3

5 Conclusion and Limitations

Unsupervised Domain Adaptation (UDA) has been widely explored for images. However, there are still few works for videos since aligning the temporal features can be challenging. There are even fewer works that exploit transformer architectures for video UDA, a promising strategy due to their performance in other tasks. To address this, we proposed the Transferable-guided Attention (TransferAttn), a framework for video UDA and one of the few works that exploit transformer architectures to adapt cross-domain knowledge. We also propose a novel Domain Transferable-guided Attention Block (DTAB) that employs attention mechanisms to encourage spatial-temporal transferability between frames. We evaluated our method on three benchmarks, the UCF \leftrightarrow HDMBfull, the Kinetics \rightarrow Gameplay, and the Kinetics \rightarrow NEC-Drone. We outperformed all the other state-of-the-art methods compared, showing the effectiveness of our proposed strategy. Our DTAB block also showed to be a promising strategy by itself. When added to other state-of-the-art UDA frameworks, it increased their performance.

We further discuss the limitations of the TransferAttn method. In future work, we intend to evaluate our framework with other video tasks, like video segmentation and action localization. Furthermore, extends our method to utilize multi-modal data and make it able to integrate into source-free methods. Lastly, explore the newer video transformer backbones.

References

  • da Costa et al. [2022] V. da Costa, G. Zara, P. Rota, T. Oliveira-Santos, N. Sebe, V. Murino, and E. Ricci. Unsupervised domain adaptation for video transformers in action recognition. In 2022 26th International Conference on Pattern Recognition (ICPR), pages 1258–1265, Los Alamitos, CA, USA, aug 2022. IEEE Computer Society. doi:10.1109/ICPR56361.2022.9956679. URL https://doi.ieeecomputersociety.org/10.1109/ICPR56361.2022.9956679.
  • Kong and Fu [2022] Yu Kong and Yun Fu. Human action recognition and prediction: A survey. International Journal of Computer Vision, 130(5):1366–1401, 2022.
  • Huang et al. [2018] D-A. Huang, V. Ramanathan, D. Mahajan, L. Torresani, M. Paluri, L. Fei-Fei, and J. C. Niebles. What makes a video a video: Analyzing temporal information in video understanding models and datasets. In CVPR, pages 7366–7375, Salt Lake City, UT, USA, June 18–22 2018. IEEE Computer Society.
  • Wang et al. [2017] Keze Wang, Dongyu Zhang, Ya Li, Ruimao Zhang, and Liang Lin. Cost-effective active learning for deep image classification. IEEE Transactions on Circuits and Systems for Video Technology, 27(12):2591–2600, 2017. doi:10.1109/TCSVT.2016.2589879.
  • Chen et al. [2022] Peipeng Chen, Yuan Gao, and Andy J. Ma. Multi-level attentive adversarial learning with temporal dilation for unsupervised video domain adaptation. In 2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 776–785, 2022. doi:10.1109/WACV51458.2022.00085.
  • Ganin et al. [2017] Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette, Mario Marchand, and Victor Lempitsky. Domain-Adversarial Training of Neural Networks, pages 189–209. Springer International Publishing, Cham, 2017. ISBN 978-3-319-58347-1. doi:10.1007/978-3-319-58347-1_10. URL https://doi.org/10.1007/978-3-319-58347-1_10.
  • Tzeng et al. [2017] Eric Tzeng, Judy Hoffman, Kate Saenko, and Trevor Darrell. Adversarial discriminative domain adaptation. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2962–2971, 2017. doi:10.1109/CVPR.2017.316.
  • Ganin and Lempitsky [2015] Yaroslav Ganin and Victor Lempitsky. Unsupervised domain adaptation by backpropagation. In Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37, ICML’15, page 1180–1189. JMLR.org, 2015.
  • Ghifary et al. [2014] Muhammad Ghifary, W. Bastiaan Kleijn, and Mengjie Zhang. Domain adaptive neural networks for object recognition. In Duc-Nghia Pham and Seong-Bae Park, editors, PRICAI 2014: Trends in Artificial Intelligence, pages 898–904, Cham, 2014. Springer International Publishing. ISBN 978-3-319-13560-1.
  • Long et al. [2017] Mingsheng Long, Han Zhu, Jianmin Wang, and Michael I. Jordan. Deep transfer learning with joint adaptation networks. In Proceedings of the 34th International Conference on Machine Learning - Volume 70, ICML’17, page 2208–2217. JMLR.org, 2017.
  • Xu et al. [2021] Tongkun Xu, Weihua Chen, Pichao Wang, Fan Wang, Hao Li, and Rong **. Cdtrans: Cross-domain transformer for unsupervised domain adaptation. CoRR, abs/2109.06165, 2021. URL https://arxiv.longhoe.net/abs/2109.06165.
  • Yang et al. [2023] **yu Yang, **g**g Liu, Ning Xu, and Junzhou Huang. Tvt: Transferable vision transformer for unsupervised domain adaptation. In 2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 520–530, 2023. doi:10.1109/WACV56688.2023.00059.
  • Munro and Damen [2020] Jonathan Munro and Dima Damen. Multi-modal Domain Adaptation for Fine-grained Action Recognition. In Computer Vision and Pattern Recognition (CVPR), 2020.
  • Chen et al. [2019] Min-Hung Chen, Zsolt Kira, Ghassan Alregib, Jaekwon Yoo, Ruxin Chen, and Jian Zheng. Temporal attentive alignment for large-scale video domain adaptation. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pages 6320–6329, 2019. doi:10.1109/ICCV.2019.00642.
  • Yin et al. [2022] Yuehao Yin, Bin Zhu, **g**g Chen, Lechao Cheng, and Yu-Gang Jiang. Mix-dann and dynamic-modal-distillation for video domain adaptation. In Proceedings of the 30th ACM International Conference on Multimedia, MM ’22, page 3224–3233, New York, NY, USA, 2022. Association for Computing Machinery. ISBN 9781450392037. doi:10.1145/3503161.3548313. URL https://doi.org/10.1145/3503161.3548313.
  • Wei et al. [2023] Pengfei Wei, Lingdong Kong, Xinghua Qu, Yi Ren, zhiqiang xu, **g Jiang, and Xiang Yin. Unsupervised video domain adaptation for action recognition: A disentanglement perspective. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=Rp4PA0ez0m.
  • Turrisi da Costa et al. [2022] Victor G. Turrisi da Costa, Giacomo Zara, Paolo Rota, Thiago Oliveira-Santos, Nicu Sebe, Vittorio Murino, and Elisa Ricci. Dual-head contrastive domain adaptation for video action recognition. In 2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 2234–2243, 2022. doi:10.1109/WACV51458.2022.00229.
  • Dasgupta et al. [2023] Avijit Dasgupta, C.V. Jawahar, and Karteek Alahari. Overcoming label noise for source-free unsupervised video domain adaptation. In Proceedings of the Thirteenth Indian Conference on Computer Vision, Graphics and Image Processing, ICVGIP ’22, New York, NY, USA, 2023. Association for Computing Machinery. ISBN 9781450398220. doi:10.1145/3571600.3571621. URL https://doi.org/10.1145/3571600.3571621.
  • Huang et al. [2022] Yi Huang, Xiaoshan Yang, Ji Zhang, and Changsheng Xu. Relative alignment network for source-free multimodal video domain adaptation. In Proceedings of the 30th ACM International Conference on Multimedia, MM ’22, page 1652–1660, New York, NY, USA, 2022. Association for Computing Machinery. ISBN 9781450392037. doi:10.1145/3503161.3548009. URL https://doi.org/10.1145/3503161.3548009.
  • Li et al. [2023] K. Li, D. Patel, E. Kruus, and M. Min. Source-free video domain adaptation with spatial-temporal-historical consistency learning. In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14643–14652, Los Alamitos, CA, USA, jun 2023. IEEE Computer Society. doi:10.1109/CVPR52729.2023.01407. URL https://doi.ieeecomputersociety.org/10.1109/CVPR52729.2023.01407.
  • Choi et al. [2020] **woo Choi, Gaurav Sharma, Manmohan Chandraker, and Jia-Bin Huang. Unsupervised and semi-supervised domain adaptation for action recognition from drones. In 2020 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 1706–1715, 2020. doi:10.1109/WACV45572.2020.9093511.
  • Ji et al. [2013] Shuiwang Ji, Wei Xu, Ming Yang, and Kai Yu. 3d convolutional neural networks for human action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(1):221–231, 2013. doi:10.1109/TPAMI.2012.59.
  • Deng et al. [2023] Andong Deng, Taojiannan Yang, and Chen Chen. A large-scale study of spatiotemporal representation learning with a new benchmark on action recognition. arXiv preprint arXiv:2303.13505, 2023.
  • Tran et al. [2015] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. Learning spatiotemporal features with 3d convolutional networks. In 2015 IEEE International Conference on Computer Vision (ICCV), pages 4489–4497, Los Alamitos, CA, USA, dec 2015. IEEE Computer Society. doi:10.1109/ICCV.2015.510. URL https://doi.ieeecomputersociety.org/10.1109/ICCV.2015.510.
  • Carreira and Zisserman [2017] Joao Carreira and Andrew Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
  • Wang et al. [2018] Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool. Temporal segment networks for action recognition in videos. IEEE transactions on pattern analysis and machine intelligence, 41(11):2740–2755, 2018.
  • Wang et al. [2021] Limin Wang, Zhan Tong, Bin Ji, and Gangshan Wu. Tdn: Temporal difference networks for efficient action recognition. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1895–1904, 2021.
  • Donahue et al. [2015] Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. Long-term recurrent convolutional networks for visual recognition and description. In IEEE conference on computer vision and pattern recognition, pages 2625–2634, 2015.
  • Yue-Hei Ng et al. [2015] Joe Yue-Hei Ng, Matthew Hausknecht, Sudheendra Vijayanarasimhan, Oriol Vinyals, Rajat Monga, and George Toderici. Beyond short snippets: Deep networks for video classification. In IEEE conference on computer vision and pattern recognition, pages 4694–4702, 2015.
  • Wu et al. [2015] Zuxuan Wu, Xi Wang, Yu-Gang Jiang, Hao Ye, and Xiangyang Xue. Modeling spatial-temporal clues in a hybrid deep learning framework for video classification. In ACM international conference on Multimedia, pages 461–470, 2015.
  • Ke et al. [2017] Qiuhong Ke, Mohammed Bennamoun, Senjian An, Ferdous Sohel, and Farid Boussaid. A new representation of skeleton sequences for 3d action recognition. In IEEE conference on computer vision and pattern recognition, pages 3288–3297, 2017.
  • Shahroudy et al. [2016] Amir Shahroudy, Jun Liu, Tian-Tsong Ng, and Gang Wang. Ntu rgb+ d: A large scale dataset for 3d human activity analysis. In IEEE conference on computer vision and pattern recognition, pages 1010–1019, 2016.
  • Zhu et al. [2016] Wentao Zhu, Cuiling Lan, Junliang Xing, Wenjun Zeng, Yanghao Li, Li Shen, and Xiaohui Xie. Co-occurrence feature learning for skeleton based action recognition using regularized deep lstm networks. In AAAI conference on artificial intelligence, volume 30, 2016.
  • Liu et al. [2016] Jun Liu, Amir Shahroudy, Dong Xu, and Gang Wang. Spatio-temporal lstm with trust gates for 3d human action recognition. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part III 14, pages 816–833. Springer, 2016.
  • Yan et al. [2018] Sijie Yan, Yuanjun Xiong, and Dahua Lin. Spatial temporal graph convolutional networks for skeleton-based action recognition. In AAAI conference on artificial intelligence, volume 32, 2018.
  • Si et al. [2019] Chenyang Si, Wentao Chen, Wei Wang, Liang Wang, and Tieniu Tan. An attention enhanced graph convolutional lstm network for skeleton-based action recognition. In IEEE/CVF conference on computer vision and pattern recognition, pages 1227–1236, 2019.
  • Kim et al. [2024] Kiyoon Kim, Shreyank N Gowda, Panagiotis Eustratiadis, Antreas Antoniou, and Robert B Fisher. Adversarial augmentation training makes action recognition models more robust to realistic video distribution shifts, 2024.
  • Lai et al. [2024] Zhengfeng Lai, Hao** Bai, Haotian Zhang, Xianzhi Du, Jiulong Shan, Yinfei Yang, Chen-Nee Chuah, and Meng Cao. Empowering unsupervised domain adaptation with large-scale pre-trained vision-language models. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 2691–2701, 2024.
  • Goodfellow et al. [2014] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems, volume 27. Curran Associates, Inc., 2014. URL https://proceedings.neurips.cc/paper_files/paper/2014/file/5ca3e9b122f61f8f06494c97b1afccf3-Paper.pdf.
  • Sharir et al. [2021] Gilad Sharir, Asaf Noy, and Lihi Zelnik-Manor. An image is worth 16x16 words, what is a video worth?, 2021.
  • Arnab et al. [2021] Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lučić, and Cordelia Schmid. Vivit: A video vision transformer. In International Conference on Computer Vision (ICCV), 2021.
  • Vaswani et al. [2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.
  • Kim et al. [2020] Donghyun Kim, Kuniaki Saito, Tae-Hyun Oh, Bryan A. Plummer, Stan Sclaroff, and Kate Saenko. Cross-domain Self-supervised Learning for Domain Adaptation with Few Source Labels. arXiv e-prints, art. arXiv:2003.08264, March 2020. doi:10.48550/arXiv.2003.08264.
  • Park et al. [2020] Changhwa Park, Jonghyun Lee, Jaeyoon Yoo, Minhoe Hur, and Sungroh Yoon. Joint Contrastive Learning for Unsupervised Domain Adaptation. arXiv e-prints, art. arXiv:2006.10297, June 2020. doi:10.48550/arXiv.2006.10297.
  • Soomro et al. [2012] Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. UCF101: A dataset of 101 human actions classes from videos in the wild. CoRR, abs/1212.0402, 2012. URL http://arxiv.longhoe.net/abs/1212.0402.
  • Kuehne et al. [2011] H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre. Hmdb: A large video database for human motion recognition. In 2011 International Conference on Computer Vision, pages 2556–2563, 2011. doi:10.1109/ICCV.2011.6126543.
  • Kay et al. [2017] Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, Mustafa Suleyman, and Andrew Zisserman. The kinetics human action video dataset, 2017.
  • Carreira et al. [2018] Joao Carreira, Eric Noland, Andras Banki-Horvath, Chloe Hillier, and Andrew Zisserman. A short note about kinetics-600, 2018.
  • He et al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
  • Song et al. [2021] Xiaolin Song, Sicheng Zhao, **gyu Yang, Huan**g Yue, Pengfei Xu, Runbo Hu, and Hua Chai. Spatio-temporal contrastive domain adaptation for action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9787–9795, June 2021.
  • Lin et al. [2022] Wei Lin, Anna Kukleva, Kunyang Sun, Horst Possegger, Hilde Kuehne, and Horst Bischof. Cycda: Unsupervised cycle domain adaptation to learn from image to video. In Shai Avidan, Gabriel Brostow, Moustapha Cissé, Giovanni Maria Farinella, and Tal Hassner, editors, Computer Vision – ECCV 2022, pages 698–715, Cham, 2022. Springer Nature Switzerland. ISBN 978-3-031-20062-5.
  • Sahoo et al. [2021] Aadarsh Sahoo, Rutav Shah, Rameswar Panda, Kate Saenko, and Abir Das. Contrast and mix: Temporal contrastive video domain adaptation with background mixing. In Marc’Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan, editors, Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pages 23386–23400, 2021. URL https://proceedings.neurips.cc/paper/2021/hash/c47e93742387750baba2e238558fa12d-Abstract.html.
  • Yang et al. [2022] Li** Yang, Yifei Huang, Yusuke Sugano, and Yoichi Sato. Interact before align: Leveraging cross-modal knowledge for domain adaptive action recognition. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14702–14712, 2022. doi:10.1109/CVPR52688.2022.01431.

Appendix A Experiment Details

The adversarial training needs the λ𝜆\lambdaitalic_λ parameters to control the Gradient Reversal Layer (GRL) weight on the adversarial head. Our DTAB module needs the parameters Q𝑄Qitalic_Q and α𝛼\alphaitalic_α, where the first controls the queue size, and the other controls the weight of the IB loss da Costa et al. [2022] and, in the GRL from DTAB, we fixed the weight as λDTAB=1subscript𝜆𝐷𝑇𝐴𝐵1\lambda_{DTAB}=1italic_λ start_POSTSUBSCRIPT italic_D italic_T italic_A italic_B end_POSTSUBSCRIPT = 1 for every adaptation task. Also, we must define the batch size and the k𝑘kitalic_k sampling frames related to the training schedule.

In the domain adaptation UCF \rightarrow HMDB Chen et al. [2019], we used a batch size of 32323232, a sample of k=53𝑘53k=53italic_k = 53 frames. Due to the smaller dataset, we used a queue size of Q=1024𝑄1024Q=1024italic_Q = 1024. Also, we used the IB loss of α=0.001𝛼0.001\alpha=0.001italic_α = 0.001 and adversarial loss of λ=1𝜆1\lambda=1italic_λ = 1. For the task HMDB \rightarrow UCF, the only change is related to the adversarial loss of λ=0.5𝜆0.5\lambda=0.5italic_λ = 0.5 to make the training more stable.

In the adaptation Kinetics \rightarrow Gameplay Chen et al. [2019], we used a batch size of 64646464, a sample of k=23𝑘23k=23italic_k = 23 frames. In this task, we reduced the queue size to Q=512𝑄512Q=512italic_Q = 512, used an IB loss of α=0.001𝛼0.001\alpha=0.001italic_α = 0.001, and a minor adversarial loss of λ=0.05𝜆0.05\lambda=0.05italic_λ = 0.05, making the training more stable.

In the task Kinetics \rightarrow NEC-Drone Choi et al. [2020], we used a batch size of 64646464, a sample of k=53𝑘53k=53italic_k = 53 frames, and a queue size of Q=512𝑄512Q=512italic_Q = 512. The IB is α=0.025𝛼0.025\alpha=0.025italic_α = 0.025 and an adversarial loss of λ=0.5𝜆0.5\lambda=0.5italic_λ = 0.5.

Appendix B More Ablation Studies

This section reports the extra ablation conducted with the TransferAttn framework.

B.1 Effect of DTAB position.

To study the impact of the position of the DTAB module, we experimented, first, changing all transformer blocks to DTAB, changing the first and last only, and placing them in odd and even positions. The results in Table 8 show that our DTAB works better when used in the place of the last transformer block when the patch features are more fine-grained than the others.

Table 8: Ablation study on Kinetics \rightarrow NEC-Drone integrating the DTAB in different encoder positions.
DTAB Position Backbone K \rightarrow N
All Blocks STAM 36.0
First Only 38.2
Even Positions 54.0
Odd Positions 65.4
Last Only 74.8