TransferAttn: Transferable-guided Attention Is All You Need for Video Domain Adaptation

André Sacilotti
Institute of Mathematics and Computer Science
University of São Paulo
[email protected]
& Samuel Felipe dos Santos
Dept. of Computing
Federal University of São Carlos
[email protected]
& Nicu Sebe
Dept. of Information Engineering and Computer Science
University of Trento
[email protected]
& Jurandy Almeida
Dept. of Computing
Federal University of São Carlos
[email protected]

Abstract

Unsupervised domain adaptation (UDA) in videos is a challenging task that remains not well explored compared to image-based UDA techniques. Although vision transformers (ViT) achieve state-of-the-art performance in many computer vision tasks, their use in video domain adaptation has still been little explored. Our key idea is to use the transformer layers as a feature encoder and incorporate spatial and temporal transferability relationships into the attention mechanism. A Transferable-guided Attention (TransferAttn) framework is then developed to exploit the capacity of the transformer to adapt cross-domain knowledge from different backbones. To improve the transferability of ViT, we introduce a novel and effective module named Domain Transferable-guided Attention Block (DTAB). DTAB compels ViT to focus on the spatio-temporal transferability relationship among video frames by changing the self-attention mechanism to a transferability attention mechanism. Extensive experiments on UCF-HMDB, Kinetics-Gameplay, and Kinetics-NEC Drone datasets with different backbones, like ResNet101, I3D, and STAM, verify the effectiveness of TransferAttn compared with state-of-the-art approaches. Also, we demonstrate that DTAB yields performance gains when applied to other state-of-the-art transformer-based UDA methods from both video and image domains. The code will be made freely available.

Keywords Action Recognition $\cdot$ Unsupervised Domain Adaptation $\cdot$ Adversarial Domain Adaptation

1 Introduction

With the popularization of social media platforms focused on user-generated content, a huge volume of data is generated, for instance, 720,000 of hours of video content is uploaded to YouTube daily ¹¹1https://www.demandsage.com/youtube-stats/ (As of February 29, 2024).. The cataloging and searching of this content is necessary, however, manually analyzing this immense amount of content is practically impossible, making video analysis tasks crucial.

Among the several video analysis tasks, action recognition is one of the most popular and challenging ones since there is a significant number of variations in the manner the action can be carried out and captured, for example, speed, duration, camera, and actor movement, occlusion, etc da Costa et al. (2022).

Various deep learning methods for action recognition are available in the literature. These approaches can be classified based on how they handle the temporal dimension. Some use 3D models to capture spatial and temporal features, while others treat spatial and temporal data separately or employ Recurrent Neural Networks (RNNs) to model the temporal dynamics Kong and Fu (2022). Despite all the advances, the temporal structure of videos still poses some challenges for training deep learning models Huang et al. (2018). Human costs are high, as many video annotations are needed to yield good results. Obtaining and annotating a desirable amount of data is difficult for many application domains, requiring significant human effort and specific knowledge Wang et al. (2017).

Unsupervised Domain Adaptation (UDA) can be used to reduce the cost of manually annotating data. In these strategies, the model is trained with labeled data from a source domain and unlabeled data from a target domain to perform well on the target domain’s test set. Since there is a domain change between source and target, UDA methods must deal with the distribution mismatch generated by the domain gap, since the domains might have different backgrounds, illumination, camera position, etc Chen et al. (2022). Several works have been proposed in the literature to address this issue, e.g., adversarial-based methods Ganin et al. (2017); Tzeng et al. (2017); Ganin and Lempitsky (2015), metric-based methods Ghifary et al. (2014); Long et al. (2017), and more recently, transformer-based methods Xu et al. (2021); Yang et al. (2023), achieving remarkable results. However, these methods are for image UDA, and video UDA is considerably less explored and is a more significant challenge, as it requires handling the temporal aspects of the data da Costa et al. (2022).

Only a few recent works Munro and Damen (2020); Chen et al. (2019); Yin et al. (2022); Wei et al. (2023); Turrisi da Costa et al. (2022); Dasgupta et al. (2023); Chen et al. (2022); Huang et al. (2022); da Costa et al. (2022); Li et al. (2023) tackle video UDA for action recognition using deep learning with strategies like contrastive learning, cross-domain attention mechanisms, self-supervised learning and multi-modalities of data. An amount even lower of works da Costa et al. (2022); Huang et al. (2022) explore transformer architectures.

In this work, we propose a novel method for video UDA in action recognition, Transferable-guided Attention (TransferAttn), which shows the potential of transformer architecture. Our method uses pre-trained frozen backbones to extract frame-by-frame features of the videos. A transformer encoder is used to reduce the domain gap and learn temporal relationships between frames. The encoder also includes our proposed transformer block, named Domain Transferable-guided Attention Block (DTAB), which introduces a new attention mechanism. Finally, we use two classification heads, one for classification and one for domain adaptation, that employ adversarial learning.

We evaluate our approach on three well-known video UDA benchmarks for action recognition, UCF $\leftrightarrow$ HDMB_full Chen et al. (2019), Kinetics $\rightarrow$ Gameplay Chen et al. (2019), and Kinetics $\rightarrow$ NEC-Drone Choi et al. (2020), where we outperform the other state-of-the-art methods. We also integrated our proposed DTAB module into other state-of-the-art transformer architectures for UDA, showing that it was able to increase performance.

The main contributions of this paper are summarized as follows:

•

To the best of our knowledge, we are the first to present a backbone-independent transformer architecture on video UDA. Our empirical experiments showed the effectiveness of the transformer encoder in extracting fine-grained spatio-temporal transferable representations.
•

We propose DTAB, a novel transferable transformer block for UDA. Our method employs a new attention mechanism that improves adaptation and domain transferability. Also, we show the positive effect of applying the DTAB module to other state-of-the-art UDA methods for videos and images.
•

We conduct extensive experiments on several benchmarks, setting a new state-of-the-art result in three different cross-domain datasets. Also, our ablation study demonstrates the positive effect of each part of our methods.

2 Related Work

2.1 Video-based Action Recognition.

Action recognition methods have been extensively studied with the advent of deep learning, especially with the introduction of large-scale video datasets, such as Kinetics, Moments-In-Time, YouTube Sports 1M, and Youtube 8M Ji et al. (2013). BEAR Deng et al. (2023) states a new benchmark in action recognition, which is made to cover a diverse set of real-world applications. Deep learning CNN models can be divided into three categories according to how they model the temporal dimension Kong and Fu (2022): (1) space-time networks, (2) multi-stream networks, and (3) hybrid models. Space-time networks use 3D convolutions to maintain temporal information, inflating 2D kernels to 3D, like C3D Tran et al. (2015) and I3D Carreira and Zisserman (2017). Multi-stream networks employ different models to deal with spatial (usually RGB images) and motion information (usually optical flow), like the TSN Wang et al. (2018) that applies temporal sampling, and the TDN Wang et al. (2021) has modules to capture short-term and long-term (across segments) motion. Hybrid models integrate recurrent networks, like LSTMs Donahue et al. (2015); Yue-Hei Ng et al. (2015); Wu et al. (2015) and Temporal CNNs Ke et al. (2017), on top of the CNNs. Skeleton data, like body joint information, can also be utilized Shahroudy et al. (2016); Zhu et al. (2016); Liu et al. (2016); Ke et al. (2017) and recent works Yan et al. (2018); Si et al. (2019) show that graph convolution obtains superior performance to RNNs and Temporal CNNs on capturing information from joints Kong and Fu (2022). Kim et al. Kim et al. (2024) presents a novelty training approach to make models robust to distribution shifts.

2.2 Unsupervised Domain Adaptation.

Unsupervised domain adaptation (UDA) in the image domain has a wide range of strategies to address the domain shift. A standard option is the adversarial-based methods Ganin et al. (2017); Tzeng et al. (2017); Ganin and Lempitsky (2015); Lai et al. (2024), which use a domain discriminator while maximizing the feature extractor loss through a min-max optimization game, similar to Generative Adversarial Networks (GAN) Goodfellow et al. (2014) training, minimizing the domain gap. In addition, the metric-based methods aim to reduce the domain gap by learning domain-invariant features through discrepancy metrics, like Maximum Mean Discrepancy (MMD) Ghifary et al. (2014) and Joint Adaptation Networks (JAN) Long et al. (2017) that incorporate a loss metric computing the discrepancy between the domain features and aim to reduce that metric to minimize the domain shift. Driven by the success of Vision Transformers, CDTrans Xu et al. (2021) adopts a three-branch cross transformer that proves to be noisy-robust. On the other side, TVT Yang et al. (2023) employs a transferability metric as a weight into class token attention weight. Although TVT Yang et al. (2023) shows great results injecting transferability into the class token weight, it lacks two essential points: i) TVT Yang et al. (2023) does not use spatial relation transferability; ii) As an image UDA, it does not incorporate the temporal relation transferability.

2.3 Unsupervised Domain Adaptation for Action Recognition.

Although there are several possible applications of UDA for action recognition in real-world problems, only a limited number of recent studies have tackled this challenging task Munro and Damen (2020); Chen et al. (2019); Yin et al. (2022); Wei et al. (2023); Turrisi da Costa et al. (2022); Dasgupta et al. (2023); Chen et al. (2022); Li et al. (2023). TA³N Chen et al. (2019) proposes a domain attention mechanism that focuses on the temporal dynamics of the videos. MA²LT-D Chen et al. (2022) generates multi-level temporal features with multiple domain discriminators. Level-wise attention weights are calculated by domain confusion and features are aggregated by attention determined by the domain discriminators. Other approaches use multiple modalities of data, like MM-SADA Munro and Damen (2020), where self-supervision among modalities is used, and MixDANN Yin et al. (2022), which dynamically estimates the most adaptable modality and uses it as a teacher to the others.

CleanAdapt Dasgupta et al. (2023) tackles the source-free video domain adaptation problem using a model pre-trained on the source domain to generate noisy labels for the target domain, and the likely correct ones are used to fine-tune the model. STHC Li et al. (2023) tackles the source-free domain using spatial and temporal augmentation. In a different approach, TranSVAE Wei et al. (2023) handles spatial and temporal domain divergence separately by constraining different sets of latent factors.

Although transformers can obtain state-of-the-art performance, only a few works for video UDA exist. UDAVT da Costa et al. (2022) is a recent work that leverages the STAM visual transformer Sharir et al. (2021) and proposes a domain alignment loss based on the Information Bottleneck (IB) principle to learn domain invariant features. Also, MTRAN Huang et al. (2022), which depends on 3D backbones, uses a transformer layer inspired by ViViT Arnab et al. (2021), where each token is a 16-frame clip representation. Although UDAVT and MTRAN show great results incorporating the transformer mechanism, the UDAVT architecture strictly depends on transformer backbones that deal separately with spatial and temporal relations, like STAM Sharir et al. (2021). At the same time, MTRAN is dependent on 3D backbones, and the attention relation is done on clip-level pooled features, lacking a more fine-grained frame-level relation. Also, none of them exploit ways to improve knowledge transferring in the transformer mechanism.

3 Our Approach

Figure 1 shows a simplified overview of our method. In Section 3.1, we first discuss the preliminaries and background on adversarial unsupervised domain adaptation and transformers, and in Section 3.2, we then detail our domain transferable-guided attention block, called DTAB, and its components.

3.1 Preliminaries and Background

Refer to caption — Figure 1: TransferAttn overview. The input video frames are fed into a fixed Backbone to extract frame-by-frame features, followed by a Clip Embedding to map frames into tokens. The embeddings are fed into a sequence of transformers to extract relevant transferable spatiotemporal information. The adaptation branch for adversarial domain discrimination uses fine-grained representations from the transformer encoder.

3.1.1 Network Overview.

The overall architecture consists of some components, including the backbone, patch embedding, encoder, classification head, and adversarial head, as shown in Figure 1. Given $n_{s}$ labeled videos in the source domain and $n_{t}$ unlabeled videos in the target domain, within $k$ sampled frames each, we defined the $j$ -th frame from the $i$ -th video as $x^{s}_{i,j}$ for the source domain and $x^{t}_{i,j}$ for the target domain.

The backbone in our method ( $\mathcal{G}_{b}$ ) is fixed and not trained. The patch embedding ( $\mathcal{G}_{p}$ ) is an MLP that maps the feature from the backbone to the transformer encoder input size. The transformer encoder ( $\mathcal{G}_{e}$ ) comprises $L-1$ transformer layers and one DTAB module, with $h$ attention heads and a hidden size of $d$ .

Related to the adversarial head, the Gradient Reversal Layer (GRL), $\mathcal{G}_{grl}$ , is used to invert the gradients, resulting in a min-max optimization, with weight as $\lambda$ , then, the discriminator, $\mathcal{G}_{D}$ , tries to discriminate whether the video originates from the source or the target domain. The classification head contains a classifier, $\mathcal{G}_{C}$ . Unlike the discriminator, the classifier MLP is not trained to make it more robust to noise labels from the source domain, to avoid a projection that can overfit over the source domain feature, and to make the learning of action classes discrimination an encoder’s responsibility.

For convenience, we refer to the extracted features for the $i$ -th video and the $j$ -th frame from the source domain as $F^{s}_{i,j}$ , and $F^{t}_{i,j}$ for the target domain.

3.1.2 Transformer Encoder.

This work aims to train our Transformer Encoder to align the data distribution, reduce the domain gap, and improve the temporal relation information between the frames. Our method does not rely on the CLS token for the transformer encoder. Instead, we use the patches’ Global Average Pooling (GAP). For convenience, we define the features from the Transformer Encoder as $f_{i}^{s}$ and $f_{i}^{t}$ for the source and target domains, respectively, as shown in Equations 1 2, i.e..

f_{i}^{s}=GAP(\mathcal{G}_{e}(\mathcal{G}_{p}(F_{i,1}^{s},F_{i,2}^{s},...,F_{i% ,k}^{s})))

(1)

f_{i}^{t}=GAP(\mathcal{G}_{e}(\mathcal{G}_{p}(F_{i,1}^{t},F_{i,2}^{t},...,F_{i% ,k}^{t})))

(2)

3.1.3 Classification Head.

This branch from the network is a classifier $\mathcal{G}_{C}$ , which trains the transformer encoder $\mathcal{G}_{e}$ to minimize the cross entropy $\mathcal{L}_{cls}$ within the source domain data and minimize the soft entropy $\mathcal{L}_{H}$ within the target data. In Equations 3 4, we define the cross-entropy loss and the soft-entropy loss, respectively, i.e..

\mathcal{L}_{cls}=-\frac{1}{n_{s}}\sum^{n_{s}}_{i=1}y_{i}\cdot\log\mathcal{G}_% {C}(f_{i}^{s})

(3)

\mathcal{L}_{H}=-\frac{1}{n_{t}}\sum^{n_{t}}_{i=1}\mathcal{G}_{C}(f_{i}^{t})% \cdot\log\mathcal{G}_{C}(f_{i}^{t})

(4)

3.1.4 Adaptation Head.

This branch of the network is a simple MLP with the Gradient Reversal Layer (GRL), which trains the discriminator $\mathcal{G}_{D}$ to identify if the video is from the source or target domain and, at the same time, it trains the encoder $\mathcal{G}_{e}$ to confuse the discriminator $\mathcal{G}_{D}$ , a min-max game and the overall loss. The $\mathcal{L}_{adv}$ is defined in Equation 5, and, for convenience, we define $\mathcal{L}_{b}$ as a binary cross entropy loss, i.e..

\mathcal{L}_{adv}=-\frac{1}{n}\sum_{i}^{n}\mathcal{L}_{b}(\mathcal{G}_{D}(% \mathcal{G}_{grl}(f_{i})),d)

(5)

3.2 DTAB: Domain Transferable-guided Attention Block

In this section, we describe our Domain Transferable-guided Attention Block (DTAB), which uses transferable attention to calculate a weight representing the transferability of each patch, considering the spatio-temporal relation dynamics from the video data.

3.2.1 MDTA: Multi-head Domain Transferable-guided Attention.

Before exploiting our proposed method, we highlight the self-attention mechanism Vaswani et al. (2017), which captures long-range dependencies. The mechanism computes this long-term dependency through the dot products between a set of query vectors ( $\mathbf{Q}$ ) and a set of key vectors ( $\mathbf{K}$ ) and weights the value vectors ( $\mathbf{V}$ ), as shown in Equation 6.

SA(Q,K,V)=softmax\left(\frac{QK^{T}}{\sqrt{d}}\right)V

(6)

Figure 2 demonstrates our proposed attention mechanism. This mechanism involves using a domain discriminator designed to binary classify every patch as belonging to the source or target domain, and the error from this discrimination composes a weight that measures the transferability of each patch. The dot product between the discrimination error produces a transferability metric related to a frame-per-frame relation, bringing temporal information. Also, integrating the method within the multi-head mechanism is responsible for considering the temporal relation between different spatial representations of the frames.

In Equation 7, we define the Domain Transferable-guided Attention (DTA), which does the dot product between the discrimination error from $Q$ and $K$ vectors, resulting in a transferability matrix that defines which patches or frames are more or less transferable than other considering the long-term temporal relation between the frames and, for convenience, we define $W^{Q}_{i}$ , $W^{K}_{i}$ , $W^{V}_{i}$ as the projection of different heads, $W^{O}$ a projection of the concatenation and $d_{h}=\frac{d_{model}}{h}$ . In other words, if a patch DDE goes to one, it is more likely to confuse the discriminator because its features are not easy to discriminate and should have more value when classifying the video action. Also, in Equation 8, we define the Multi-head Domain Transferable-guided Attention (MDTA), which incorporates the information from different token subspace representations. The purpose of the Gradient Reversal Layer (GRL) is to avoid the discriminator overfiting from classification head back-propagation.

\begin{split}DTA_{i}(Q,K,V)=\\ softmax\left(\frac{{DDE}(\mathcal{G}_{D^{\prime}}(QW^{Q}_{i}))\cdot{DDE}(% \mathcal{G}_{D^{\prime}}(KW^{K}_{i}))^{T}}{\sqrt{d_{h}}}\right)VW^{V}_{i}\end{split}

(7)

MDTA(Q,K,V)=concat(\{DTA_{i}\}_{i=1}^{h})W_{O}

(8)

For convenience, we defined the domain discriminator as $\mathcal{G}_{D^{\prime}}$ . Also, we defined the Domain Discriminator Error, $DDE$ , as a binary cross entropy, as shown in Equation 9.

DDE(x)=\begin{cases}\log x,&\text{if x is from source}\\ \log(1-x),&\text{otherwise}\end{cases}

(9)

In short, integrating the multi-head mechanism with the new attention mechanism brings spatiotemporal information to give more weight to frames or patches that are likely to confuse the domain discriminator.

3.2.2 Information Bottleneck.

The Information Bottleneck (IB) principle has been widely adopted in self-supervised based frameworks Kim et al. (2020); Park et al. (2020) to learn better feature representation. Turrisi da Costa et al. da Costa et al. (2022) proposed a novelty method that employs the IB principle to align the domain features through contrastive learning. This method establishes a loss function as shown in Equation 10 that uses a cross-correlation matrix defined as $C_{i,j}=\frac{\sum^{B}_{b}z_{i,b}\cdot z^{{}^{\prime}}_{j,b}}{\sqrt{\sum^{B}_{% b}(z_{i,b})^{2}}\sqrt{\sum^{B}_{b}(z^{{}^{\prime}}_{j,b})^{2}}}$ , between the pairs of source representations ${z}^{s}_{i}$ and target features ${z}^{s}_{j}$ from a batch of $B$ over a total of $d$ features. Also, Turrisi da Costa et al. da Costa et al. (2022) propose a queue to keep stored recent source features and use them to increase the number of pairs.

L_{ib}=\sum^{d}_{i=1}(1-C_{i,i})^{2}+\lambda\sum^{d}_{i=1}\sum^{d}_{j\neq i}(C% _{i,j})^{2}

(10)

3.2.3 Transformer Block.

Figure 3 shows our Domain Transferable-guided Attention Block (DTAB) that integrates our attention mechanism with the Queued Information Bottleneck da Costa et al. (2022). We maintain the transformer structure Vaswani et al. (2017), changing the MSA to MDTA and adding the information bottleneck loss after the last residual connection. In our work, the DTAB attention mechanism weights the spatiotemporal relation transferability between the frames. At the same time, the IB da Costa et al. (2022) integrates a loss that minimizes the discrepancy between the residual features from MDTA.

4 Experiments and Results

To exploit the effectiveness of our method, we proposed several studies on multiple datasets from video domain adaptation against the classical and state-of-the-art methods.

4.1 Datasets

We compare our proposed method with the state-of-the-art in three different benchmarks.

UCF $\leftrightarrow$ HDMB_full Chen et al. (2019) is one of the most widely used, containing a subset of videos from two public datasets, UCF101 Soomro et al. (2012) and HMDB51 Kuehne et al. (2011), representing a total of 3209 videos and 12 classes.

Kinetics $\rightarrow$ Gameplay Chen et al. (2019) is a non-public dataset that contains a subset of videos from the well-known Kinetics-400 Kay et al. (2017) and a private gameplay dataset Chen et al. (2019). The dataset contains 49998 videos and 30 classes.

Kinetics $\rightarrow$ NEC-Drone Choi et al. (2020) is a public dataset that contains videos from Kinetics-600 Carreira et al. (2018) and NEC-Drone. The dataset contains 10118 videos and 7 classes. For a fair comparison, we used the cropped version of NEC-Drone da Costa et al. (2022) that focused on the actors in the action.

4.2 Experimental Setup

For extracting features from the videos, we studied three different backbones, ResNet101 He et al. (2016), I3D Carreira and Zisserman (2017), and STAM Transformer Sharir et al. (2021), all of them pretrained. Our method depends on frame-level features, so for the 3D backbones, like STAM Sharir et al. (2021) and I3D Carreira and Zisserman (2017), we densely slide a temporal window of 16 frames along the videos, so, for a frame k in the video, the window use frames from $k-7$ to $k+8$ with zero-pad in the beginning and the end of the video.

Our transformer encoder comprises $4$ blocks, whereas our Domain Transferable-guided Attention Block (DTAB) is the last, and every transformer block is composed within $h=8$ attention heads with a hidden size of $d_{model}=512$ . We used the ADAM optimizer for the training schedule with a weight decay of $5\cdot 10^{-4}$ and a learning rate of $3\cdot 10^{-5}$ for $300$ epochs. Also, the sampling strategy is to divide the action videos into k segments and then randomly sample one frame for each segment. We present the adversarial training, DTAB hyperparameters and sampling parameters in the supplementary material. The implementation is available based on the Pytorch framework, and the source code is released here. Also, For the sake of reproducibility, we release the extracted features for the public datasets for each backbone.

4.3 Comparisons to the State-of-the-art Methods

In this section, we evaluate our method in three different video domain adaptation datasets and compare them with different methods from the literature.

4.3.1 Results on UCF101 $\leftrightarrow$ HMDB51_full.

As shown in Table 1, we compare our results using three different backbones. The first one is ResNet101, a 2D backbone, in which our method achieves a significant average increase of $2.4\%$ , resulting in $88.2\%$ . This accuracy surpasses some methods that use the I3D backbone, showing that even with a 2D backbone, our method can extract significant spatiotemporal information.

Looking at methods that use an I3D backbone, our method surpasses even works that use multi-modal data (RGB and Flow). From single-modal data, our method achieves a significant average increase of $3.5\%$ and, compared to multi-modal methods, can be seen as an average increase of $0.4\%$ . Finally, for STAM Transformer, it can be seen in Table 1 that our method achieved a significant average increase of $0.9\%$ .

Table 1: Classification Accuracy on UCF101

\leftrightarrow

HMDB51_full. Multi-modal methods are represented with (R + F).

Method	Backbone	U $\rightarrow$ H	H $\rightarrow$ U	Average
Source Only	ResNet101	73.9	71.7	72.8
T $A^{3}$ N Chen et al. (2019)		78.3	81.8	80.1
M $A^{2}$ LT-D Chen et al. (2022)		85.0	86.6	85.8
TransferAttn (ours)		88.1	88.3	88.2
Source Only		82.2	88.1	85.2
STHC Li et al. (2023)	TRN	90.9	92.1	91.5
TranSVAE Wei et al. (2023)		92.2	96.5	94.3
TransferAttn (ours)		93.5	97.1	95.3
Source Only	I3D	80.6	89.3	85.0
STCDA (R + F) Song et al. (2021)		83.1	92.1	87.7
CycDA Lin et al. (2022)		88.1	90.0	89.1
M $A^{2}$ LT-D Chen et al. (2022)		89.3	91.2	90.3
CoMix Sahoo et al. (2021)		86.7	93.9	90.3
C $O^{2}$ A Turrisi da Costa et al. (2022)		87.8	95.8	91.8
CIA (R + F) Yang et al. (2022)		91.9	94.6	93.3
TranSVAE Wei et al. (2023)		87.8	99.0	93.4
MTRAN (R + F) Huang et al. (2022)		92.2	95.3	93.8
CleanAdapt (R + F) Dasgupta et al. (2023)		93.6	99.3	96.5
TransferAttn (ours)		94.4	99.4	96.9
Source Only	STAM	86.9	93.7	90.3
TranSVAE Wei et al. (2023)		93.5	99.5	96.5
UDAVT da Costa et al. (2022)		92.3	96.8	94.6
M $A^{2}$ LT-D Chen et al. (2022)		95.3	99.4	97.4
TransferAttn (ours)		97.2	99.7	98.5

4.3.2 Results on Kinetics $\rightarrow$ Gameplay.

We evaluate our method in the task Kinetics $\rightarrow$ Gameplay, as shown in Table 2. Due to being a dataset that does not give access to the raw videos, just to the frame-level features from ResNet101, we could not provide results for methods that the backbone is not fixed in the training stage or relies on video-level features. As shown in Table 2, our method achieves a significant increase of $6.5\%$ in accuracy compared with the state-of-the-art M $A^{2}$ LT-D and is better than all prior methods.

Table 2: Classification Accuracy on Kinetics

\rightarrow

Gameplay dataset.

Method	Backbone	K $\rightarrow$ G
Source Only	ResNet101	17.6
T $A^{3}$ N Chen et al. (2019)		27.5
M $A^{2}$ LT-D Chen et al. (2022)		31.5
TranSVAE Wei et al. (2023)		21.9
TransferAttn (ours)		37.0

4.3.3 Results on Kinetics $\rightarrow$ NEC-Drone.

Our method was also evaluated in the Kinetics $\rightarrow$ NEC-Drone task, as we can see in Table 3, achieving a significant increase of $9.5\%$ in comparison with UDAVT, establishing a new SOA result.

Table 3: Classification Accuracy on Kinetics

\rightarrow

NEC-Drone dataset.

Method	Backbone	K $\rightarrow$ N
Source Only	STAM	29.4
M $A^{2}$ LT-D Chen et al. (2022)		55.4
TranSVAE Wei et al. (2023)		55.9
UDAVT da Costa et al. (2022)		65.3
TransferAttn (ours)		74.8

#	Components	K $\rightarrow$ N
a)	Standard Transformer	45.5
b)	+ MDTA	54.8
c)	+ IB	58.3
d)	DTAB (MDTA + IB)	74.8

4.4 Ablation Study

This subsection provides an ablation for the main choices of our method, the effect of each loss function and we show how our new transformer block (DTAB) can improve the results on other transformer UDA methods for action recognition and image classification.

4.4.1 Effect of DTAB components.

To learn the individual contributions between the components that integrate our Domain Transferable-guided Attention Block (DTAB) in improving the capability of domain transferability, we proposed a study to evaluate how each component of the DTAB impacts the final result. Figure 4 summarizes the results, showing the accuracy from Kinetics $\rightarrow$ NEC-Drone dataset and the t-SNE visualization.

Changing the Self-Attention (MSA) mechanism to our Domain Transferable Attention (MDTA) in the last transformer block increases the accuracy in $9.3\%$ , whereas using the IB mechanism in the last transformer with the self-attention mechanism solely increases the accuracy in $12.8\%$ . Finally, integrating all the components, we achieve a significant increase of $29.3\%$ , indicating the significance of both components in reducing the domain gap. Also, the t-SNE plot clearly shows how each component impacts the clusterization of the low-dimensional features from different classes, and it is easy to see that the combination of all the components makes features from the same class better clustered while different classes are farther away.

4.4.2 Effect of integrating each loss component.

We conduct an ablation study by integrating each loss function, $\mathcal{L}^{s}_{cls}$ , $\mathcal{L}^{t}_{H}$ , $\mathcal{L}_{adv}$ , $\mathcal{L}_{ib}$ . As seen in Table 4, adding a new component gradually improves the domain alignment, reducing the domain gap, and the best result is achieved when we use all components.

Table 4: Loss integration studies with accuracy on Kinetics

\rightarrow

NEC-Drone dataset.

$\mathcal{L}^{s}_{cls}$	$\mathcal{L}^{t}_{H}$	$\mathcal{L}_{adv}$	$\mathcal{L}_{ib}$	K $\rightarrow$ N
✓				21.6
✓	✓			27.9
✓	✓	✓		54.8
✓	✓		✓	64.9
✓	✓	✓	✓	74.8

4.4.3 Effects of DTAB to other transformer VUDA methods.

Another significant ablation to study is the effect of our new transformer block, DTAB, on other transformer VUDA methods, and, for a fair comparison, we studied the effects of TVT Yang et al. (2023) attention mechanism (TAM). For this study, we selected the UDAVT and the task UCF $\rightarrow$ HMDB_full, and we reported the result obtained by our reproduction using the author code²²2https://github.com/vturrisi/UDAVT. The results in Table 5 show that adding our new transformer block increased the task accuracy by $1.7\%$ , while the TVT Yang et al. (2023) attention mechanism (TAM) resulted in a slightly better accuracy. This result shows that our DTAB module can integrate with other transformer methods in action recognition domain adaptation, helps reduce the domain gap, and deals better with spatiotemporal information than the TVT mechanism. Due to its source-free characteristics, we could not evaluate DTAB on MTRAN Huang et al. (2022) in this ablation.

Table 5: Classification accuracy in HMDB-UCF_full dataset integrating DTAB on state-of-the-art transformer video unsupervised domain adaptation architectures

Method	Backbone	U $\rightarrow$ H	H $\rightarrow$ U	Average
UDAVT (Our implementation) da Costa et al. (2022)	STAM	92.2	96.5	94.4
UDAVT + TAM Yang et al. (2023)		92.5	96.9	94.7
UDAVT + DTAB		94.2	97.9	96.1

4.4.4 Effects of DTAB to image transformers UDA method.

We also conducted experiments adapting the TVT and CDTrans methods to use our DTAB module. The results reported in Table 6 were obtained through our reproduction using the author’s code³³3https://github.com/uta-smile/TVT/^,⁴⁴4https://github.com/CDTrans/CDTrans/ with the Office-31 datasets. The results show a slight increase of $1.1\%$ in the accuracy of the TVT Yang et al. (2023), while in the CDTrans Xu et al. (2021) we observe an increase of $0.8\%$ in accuracy.

Table 6: Classification accuracy in Office-31 dataset integrating DTAB on state-of-the-art image unsupervised domain adaptation transformer architectures

Method	Backbone	A $\rightarrow$ W	D $\rightarrow$ W	A $\rightarrow$ D	D $\rightarrow$ A	W $\rightarrow$ A	Avg.
TVT (Our impl.) Yang et al. (2023)	ViT-B 16	93.6	98.7	93.7	79.4	78.7	88.8
TVT + DTAB	ViT-B 16	94.7	99.9	95.0	80.3	79.5	89.9
CDTrans (Our impl.) Xu et al. (2021)	DeiT-S	93.5	98.2	94.0	77.7	77.0	88.1
CDTrans + DTAB	DeiT-S	94.7	98.4	95.6	77.7	78.1	88.9

4.4.5 Complexity Analysis.

We also conduct a complexity analysis on our TransferAttn architecture. The study compares some baseline methods regarding the amount of Trainable Parameters (#Parameters) and the floating point operations (GFLOPs). We included the I3D backbone FLOPs for all the reported methods except for the UDAVT, which strictly uses the STAM as the backbone. As shown in Table 7, our method has slightly more trainable parameters than TranSVAE, compensated by the significant improvement in domain adaptation reported in Section 4.3. Compared with other transformer methods, like UDAVT, our method has expressive less training parameters and floating point operations with a significant improvement in domain adaptation, indicating the efficiency of our transformer architecture.

Table 7: Comparison on Model Complexity

Methods	$\#$ Parameters	GFLOPs
CoMix	30.4 M	37.1
TranSVAE	12.7 M	36.5
M $A^{2}$ LT-D	49.7 M	51.2
UDAVT	118.9 M	524.7
TransferAttn (ours)	16.3 M	55.3

5 Conclusion and Limitations

Unsupervised Domain Adaptation (UDA) has been widely explored for images. However, there are still few works for videos since aligning the temporal features can be challenging. There are even fewer works that exploit transformer architectures for video UDA, a promising strategy due to their performance in other tasks. To address this, we proposed the Transferable-guided Attention (TransferAttn), a framework for video UDA and one of the few works that exploit transformer architectures to adapt cross-domain knowledge. We also propose a novel Domain Transferable-guided Attention Block (DTAB) that employs attention mechanisms to encourage spatial-temporal transferability between frames. We evaluated our method on three benchmarks, the UCF $\leftrightarrow$ HDMB_full, the Kinetics $\rightarrow$ Gameplay, and the Kinetics $\rightarrow$ NEC-Drone. We outperformed all the other state-of-the-art methods compared, showing the effectiveness of our proposed strategy. Our DTAB block also showed to be a promising strategy by itself. When added to other state-of-the-art UDA frameworks, it increased their performance.

We further discuss the limitations of the TransferAttn method. In future work, we intend to evaluate our framework with other video tasks, like video segmentation and action localization. Furthermore, extends our method to utilize multi-modal data and make it able to integrate into source-free methods. Lastly, explore the newer video transformer backbones.

References

da Costa et al. [2022] V. da Costa, G. Zara, P. Rota, T. Oliveira-Santos, N. Sebe, V. Murino, and E. Ricci. Unsupervised domain adaptation for video transformers in action recognition. In 2022 26th International Conference on Pattern Recognition (ICPR), pages 1258–1265, Los Alamitos, CA, USA, aug 2022. IEEE Computer Society. doi:10.1109/ICPR56361.2022.9956679. URL https://doi.ieeecomputersociety.org/10.1109/ICPR56361.2022.9956679.
Kong and Fu [2022] Yu Kong and Yun Fu. Human action recognition and prediction: A survey. International Journal of Computer Vision, 130(5):1366–1401, 2022.
Huang et al. [2018] D-A. Huang, V. Ramanathan, D. Mahajan, L. Torresani, M. Paluri, L. Fei-Fei, and J. C. Niebles. What makes a video a video: Analyzing temporal information in video understanding models and datasets. In CVPR, pages 7366–7375, Salt Lake City, UT, USA, June 18–22 2018. IEEE Computer Society.
Wang et al. [2017] Keze Wang, Dongyu Zhang, Ya Li, Ruimao Zhang, and Liang Lin. Cost-effective active learning for deep image classification. IEEE Transactions on Circuits and Systems for Video Technology, 27(12):2591–2600, 2017. doi:10.1109/TCSVT.2016.2589879.
Chen et al. [2022] Peipeng Chen, Yuan Gao, and Andy J. Ma. Multi-level attentive adversarial learning with temporal dilation for unsupervised video domain adaptation. In 2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 776–785, 2022. doi:10.1109/WACV51458.2022.00085.
Ganin et al. [2017] Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette, Mario Marchand, and Victor Lempitsky. Domain-Adversarial Training of Neural Networks, pages 189–209. Springer International Publishing, Cham, 2017. ISBN 978-3-319-58347-1. doi:10.1007/978-3-319-58347-1_10. URL https://doi.org/10.1007/978-3-319-58347-1_10.
Tzeng et al. [2017] Eric Tzeng, Judy Hoffman, Kate Saenko, and Trevor Darrell. Adversarial discriminative domain adaptation. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2962–2971, 2017. doi:10.1109/CVPR.2017.316.
Ganin and Lempitsky [2015] Yaroslav Ganin and Victor Lempitsky. Unsupervised domain adaptation by backpropagation. In Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37, ICML’15, page 1180–1189. JMLR.org, 2015.
Ghifary et al. [2014] Muhammad Ghifary, W. Bastiaan Kleijn, and Mengjie Zhang. Domain adaptive neural networks for object recognition. In Duc-Nghia Pham and Seong-Bae Park, editors, PRICAI 2014: Trends in Artificial Intelligence, pages 898–904, Cham, 2014. Springer International Publishing. ISBN 978-3-319-13560-1.
Long et al. [2017] Mingsheng Long, Han Zhu, Jianmin Wang, and Michael I. Jordan. Deep transfer learning with joint adaptation networks. In Proceedings of the 34th International Conference on Machine Learning - Volume 70, ICML’17, page 2208–2217. JMLR.org, 2017.
Xu et al. [2021] Tongkun Xu, Weihua Chen, Pichao Wang, Fan Wang, Hao Li, and Rong **. Cdtrans: Cross-domain transformer for unsupervised domain adaptation. CoRR, abs/2109.06165, 2021. URL https://arxiv.longhoe.net/abs/2109.06165.
Yang et al. [2023] **yu Yang, **g**g Liu, Ning Xu, and Junzhou Huang. Tvt: Transferable vision transformer for unsupervised domain adaptation. In 2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 520–530, 2023. doi:10.1109/WACV56688.2023.00059.
Munro and Damen [2020] Jonathan Munro and Dima Damen. Multi-modal Domain Adaptation for Fine-grained Action Recognition. In Computer Vision and Pattern Recognition (CVPR), 2020.
Chen et al. [2019] Min-Hung Chen, Zsolt Kira, Ghassan Alregib, Jaekwon Yoo, Ruxin Chen, and Jian Zheng. Temporal attentive alignment for large-scale video domain adaptation. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pages 6320–6329, 2019. doi:10.1109/ICCV.2019.00642.
Yin et al. [2022] Yuehao Yin, Bin Zhu, **g**g Chen, Lechao Cheng, and Yu-Gang Jiang. Mix-dann and dynamic-modal-distillation for video domain adaptation. In Proceedings of the 30th ACM International Conference on Multimedia, MM ’22, page 3224–3233, New York, NY, USA, 2022. Association for Computing Machinery. ISBN 9781450392037. doi:10.1145/3503161.3548313. URL https://doi.org/10.1145/3503161.3548313.
Wei et al. [2023] Pengfei Wei, Lingdong Kong, Xinghua Qu, Yi Ren, zhiqiang xu, **g Jiang, and Xiang Yin. Unsupervised video domain adaptation for action recognition: A disentanglement perspective. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=Rp4PA0ez0m.
Turrisi da Costa et al. [2022] Victor G. Turrisi da Costa, Giacomo Zara, Paolo Rota, Thiago Oliveira-Santos, Nicu Sebe, Vittorio Murino, and Elisa Ricci. Dual-head contrastive domain adaptation for video action recognition. In 2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 2234–2243, 2022. doi:10.1109/WACV51458.2022.00229.
Dasgupta et al. [2023] Avijit Dasgupta, C.V. Jawahar, and Karteek Alahari. Overcoming label noise for source-free unsupervised video domain adaptation. In Proceedings of the Thirteenth Indian Conference on Computer Vision, Graphics and Image Processing, ICVGIP ’22, New York, NY, USA, 2023. Association for Computing Machinery. ISBN 9781450398220. doi:10.1145/3571600.3571621. URL https://doi.org/10.1145/3571600.3571621.
Huang et al. [2022] Yi Huang, Xiaoshan Yang, Ji Zhang, and Changsheng Xu. Relative alignment network for source-free multimodal video domain adaptation. In Proceedings of the 30th ACM International Conference on Multimedia, MM ’22, page 1652–1660, New York, NY, USA, 2022. Association for Computing Machinery. ISBN 9781450392037. doi:10.1145/3503161.3548009. URL https://doi.org/10.1145/3503161.3548009.
Li et al. [2023] K. Li, D. Patel, E. Kruus, and M. Min. Source-free video domain adaptation with spatial-temporal-historical consistency learning. In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14643–14652, Los Alamitos, CA, USA, jun 2023. IEEE Computer Society. doi:10.1109/CVPR52729.2023.01407. URL https://doi.ieeecomputersociety.org/10.1109/CVPR52729.2023.01407.
Choi et al. [2020] **woo Choi, Gaurav Sharma, Manmohan Chandraker, and Jia-Bin Huang. Unsupervised and semi-supervised domain adaptation for action recognition from drones. In 2020 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 1706–1715, 2020. doi:10.1109/WACV45572.2020.9093511.
Ji et al. [2013] Shuiwang Ji, Wei Xu, Ming Yang, and Kai Yu. 3d convolutional neural networks for human action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(1):221–231, 2013. doi:10.1109/TPAMI.2012.59.
Deng et al. [2023] Andong Deng, Taojiannan Yang, and Chen Chen. A large-scale study of spatiotemporal representation learning with a new benchmark on action recognition. arXiv preprint arXiv:2303.13505, 2023.
Tran et al. [2015] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. Learning spatiotemporal features with 3d convolutional networks. In 2015 IEEE International Conference on Computer Vision (ICCV), pages 4489–4497, Los Alamitos, CA, USA, dec 2015. IEEE Computer Society. doi:10.1109/ICCV.2015.510. URL https://doi.ieeecomputersociety.org/10.1109/ICCV.2015.510.
Carreira and Zisserman [2017] Joao Carreira and Andrew Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
Wang et al. [2018] Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool. Temporal segment networks for action recognition in videos. IEEE transactions on pattern analysis and machine intelligence, 41(11):2740–2755, 2018.
Wang et al. [2021] Limin Wang, Zhan Tong, Bin Ji, and Gangshan Wu. Tdn: Temporal difference networks for efficient action recognition. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1895–1904, 2021.
Donahue et al. [2015] Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. Long-term recurrent convolutional networks for visual recognition and description. In IEEE conference on computer vision and pattern recognition, pages 2625–2634, 2015.
Yue-Hei Ng et al. [2015] Joe Yue-Hei Ng, Matthew Hausknecht, Sudheendra Vijayanarasimhan, Oriol Vinyals, Rajat Monga, and George Toderici. Beyond short snippets: Deep networks for video classification. In IEEE conference on computer vision and pattern recognition, pages 4694–4702, 2015.
Wu et al. [2015] Zuxuan Wu, Xi Wang, Yu-Gang Jiang, Hao Ye, and Xiangyang Xue. Modeling spatial-temporal clues in a hybrid deep learning framework for video classification. In ACM international conference on Multimedia, pages 461–470, 2015.
Ke et al. [2017] Qiuhong Ke, Mohammed Bennamoun, Senjian An, Ferdous Sohel, and Farid Boussaid. A new representation of skeleton sequences for 3d action recognition. In IEEE conference on computer vision and pattern recognition, pages 3288–3297, 2017.
Shahroudy et al. [2016] Amir Shahroudy, Jun Liu, Tian-Tsong Ng, and Gang Wang. Ntu rgb+ d: A large scale dataset for 3d human activity analysis. In IEEE conference on computer vision and pattern recognition, pages 1010–1019, 2016.
Zhu et al. [2016] Wentao Zhu, Cuiling Lan, Junliang Xing, Wenjun Zeng, Yanghao Li, Li Shen, and Xiaohui Xie. Co-occurrence feature learning for skeleton based action recognition using regularized deep lstm networks. In AAAI conference on artificial intelligence, volume 30, 2016.
Liu et al. [2016] Jun Liu, Amir Shahroudy, Dong Xu, and Gang Wang. Spatio-temporal lstm with trust gates for 3d human action recognition. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part III 14, pages 816–833. Springer, 2016.
Yan et al. [2018] Sijie Yan, Yuanjun Xiong, and Dahua Lin. Spatial temporal graph convolutional networks for skeleton-based action recognition. In AAAI conference on artificial intelligence, volume 32, 2018.
Si et al. [2019] Chenyang Si, Wentao Chen, Wei Wang, Liang Wang, and Tieniu Tan. An attention enhanced graph convolutional lstm network for skeleton-based action recognition. In IEEE/CVF conference on computer vision and pattern recognition, pages 1227–1236, 2019.
Kim et al. [2024] Kiyoon Kim, Shreyank N Gowda, Panagiotis Eustratiadis, Antreas Antoniou, and Robert B Fisher. Adversarial augmentation training makes action recognition models more robust to realistic video distribution shifts, 2024.
Lai et al. [2024] Zhengfeng Lai, Hao** Bai, Haotian Zhang, Xianzhi Du, Jiulong Shan, Yinfei Yang, Chen-Nee Chuah, and Meng Cao. Empowering unsupervised domain adaptation with large-scale pre-trained vision-language models. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 2691–2701, 2024.
Goodfellow et al. [2014] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems, volume 27. Curran Associates, Inc., 2014. URL https://proceedings.neurips.cc/paper_files/paper/2014/file/5ca3e9b122f61f8f06494c97b1afccf3-Paper.pdf.
Sharir et al. [2021] Gilad Sharir, Asaf Noy, and Lihi Zelnik-Manor. An image is worth 16x16 words, what is a video worth?, 2021.
Arnab et al. [2021] Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lučić, and Cordelia Schmid. Vivit: A video vision transformer. In International Conference on Computer Vision (ICCV), 2021.
Vaswani et al. [2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.
Kim et al. [2020] Donghyun Kim, Kuniaki Saito, Tae-Hyun Oh, Bryan A. Plummer, Stan Sclaroff, and Kate Saenko. Cross-domain Self-supervised Learning for Domain Adaptation with Few Source Labels. arXiv e-prints, art. arXiv:2003.08264, March 2020. doi:10.48550/arXiv.2003.08264.
Park et al. [2020] Changhwa Park, Jonghyun Lee, Jaeyoon Yoo, Minhoe Hur, and Sungroh Yoon. Joint Contrastive Learning for Unsupervised Domain Adaptation. arXiv e-prints, art. arXiv:2006.10297, June 2020. doi:10.48550/arXiv.2006.10297.
Soomro et al. [2012] Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. UCF101: A dataset of 101 human actions classes from videos in the wild. CoRR, abs/1212.0402, 2012. URL http://arxiv.longhoe.net/abs/1212.0402.
Kuehne et al. [2011] H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre. Hmdb: A large video database for human motion recognition. In 2011 International Conference on Computer Vision, pages 2556–2563, 2011. doi:10.1109/ICCV.2011.6126543.
Kay et al. [2017] Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, Mustafa Suleyman, and Andrew Zisserman. The kinetics human action video dataset, 2017.
Carreira et al. [2018] Joao Carreira, Eric Noland, Andras Banki-Horvath, Chloe Hillier, and Andrew Zisserman. A short note about kinetics-600, 2018.
He et al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
Song et al. [2021] Xiaolin Song, Sicheng Zhao, **gyu Yang, Huan**g Yue, Pengfei Xu, Runbo Hu, and Hua Chai. Spatio-temporal contrastive domain adaptation for action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9787–9795, June 2021.
Lin et al. [2022] Wei Lin, Anna Kukleva, Kunyang Sun, Horst Possegger, Hilde Kuehne, and Horst Bischof. Cycda: Unsupervised cycle domain adaptation to learn from image to video. In Shai Avidan, Gabriel Brostow, Moustapha Cissé, Giovanni Maria Farinella, and Tal Hassner, editors, Computer Vision – ECCV 2022, pages 698–715, Cham, 2022. Springer Nature Switzerland. ISBN 978-3-031-20062-5.
Sahoo et al. [2021] Aadarsh Sahoo, Rutav Shah, Rameswar Panda, Kate Saenko, and Abir Das. Contrast and mix: Temporal contrastive video domain adaptation with background mixing. In Marc’Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan, editors, Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pages 23386–23400, 2021. URL https://proceedings.neurips.cc/paper/2021/hash/c47e93742387750baba2e238558fa12d-Abstract.html.
Yang et al. [2022] Li** Yang, Yifei Huang, Yusuke Sugano, and Yoichi Sato. Interact before align: Leveraging cross-modal knowledge for domain adaptive action recognition. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14702–14712, 2022. doi:10.1109/CVPR52688.2022.01431.

Appendix A Experiment Details

The adversarial training needs the $\lambda$ parameters to control the Gradient Reversal Layer (GRL) weight on the adversarial head. Our DTAB module needs the parameters $Q$ and $\alpha$ , where the first controls the queue size, and the other controls the weight of the IB loss da Costa et al. [2022] and, in the GRL from DTAB, we fixed the weight as $\lambda_{DTAB}=1$ for every adaptation task. Also, we must define the batch size and the $k$ sampling frames related to the training schedule.

In the domain adaptation UCF $\rightarrow$ HMDB Chen et al. [2019], we used a batch size of $32$ , a sample of $k=53$ frames. Due to the smaller dataset, we used a queue size of $Q=1024$ . Also, we used the IB loss of $\alpha=0.001$ and adversarial loss of $\lambda=1$ . For the task HMDB $\rightarrow$ UCF, the only change is related to the adversarial loss of $\lambda=0.5$ to make the training more stable.

In the adaptation Kinetics $\rightarrow$ Gameplay Chen et al. [2019], we used a batch size of $64$ , a sample of $k=23$ frames. In this task, we reduced the queue size to $Q=512$ , used an IB loss of $\alpha=0.001$ , and a minor adversarial loss of $\lambda=0.05$ , making the training more stable.

In the task Kinetics $\rightarrow$ NEC-Drone Choi et al. [2020], we used a batch size of $64$ , a sample of $k=53$ frames, and a queue size of $Q=512$ . The IB is $\alpha=0.025$ and an adversarial loss of $\lambda=0.5$ .

Appendix B More Ablation Studies

This section reports the extra ablation conducted with the TransferAttn framework.

B.1 Effect of DTAB position.

To study the impact of the position of the DTAB module, we experimented, first, changing all transformer blocks to DTAB, changing the first and last only, and placing them in odd and even positions. The results in Table 8 show that our DTAB works better when used in the place of the last transformer block when the patch features are more fine-grained than the others.

Table 8: Ablation study on Kinetics

\rightarrow

NEC-Drone integrating the DTAB in different encoder positions.

DTAB Position	Backbone	K $\rightarrow$ N
All Blocks	STAM	36.0
First Only		38.2
Even Positions		54.0
Odd Positions		65.4
Last Only		74.8