eMoE-Tracker: Environmental MoE-based Transformer for Robust Event-guided Object Tracking

Yucheng Chen1 and Lin Wang1,2,∗ *Corresponding author1Yucheng Chen is with the AI Thrust, The Hong Kong University of Science and Technology (Guangzhou), Guangdong, China. [email protected]1,2Lin Wang is with AI/CMA Thrust, HKUST(GZ) and Dept. of CSE, HKUST, Hong Kong SAR, China, Email: [email protected]
Abstract

The unique complementarity of frame-based and event cameras for high frame rate object tracking has recently inspired some research attempts to develop multi-modal fusion approaches. However, these methods directly fuse both modalities and thus ignore the environmental attributes, e.g., motion blur, illumination variance, occlusion, scale variation, etc. Meanwhile, no interaction between search and template features makes distinguishing target objects and backgrounds difficult. As a result, performance degradation is induced especially in challenging conditions. This paper proposes a novel and effective Transformer-based event-guided tracking framework, called eMoE-Tracker, which achieves new SOTA performance under various conditions. Our key idea is to disentangle the environment into several learnable attributes to dynamically learn the attribute-specific features for better interaction and discriminability between the target information and background. To achieve the goal, we first propose an environmental Mix-of-Experts (eMoE) module that is built upon the environmental Attributes Disentanglement to learn attribute-specific features and environmental Attributes Gating to assemble the attribute-specific features by the learnable attribute scores dynamically. The eMoE module is a subtle router that fine-tunes the transformer backbone more efficiently. We then introduce a contrastive relation modeling (CRM) module to improve interaction and discriminability between the target information and background. Extensive experiments on diverse event-based benchmark datasets showcase the superior performance of our eMoE-Tracker compared to the prior arts. Project page: https://vlislab22.github.io/eMoE-Tracker/

Abstract

Due to the limited space in the main paper, we provide additional material for the proposed method and experimental results. Sec. VI introduce the datasets. Then, Sec. VII illustrates more details about the implementation and experiments. Afterward, we report more visual results and performance evaluation under different attributes in Sec. VIII. In the end, Sec. IX makes summary for the complementary material.

Index Terms:
Event-guided object tracking, mixture-of-experts, contrastive learning.

I INTRODUCTION

Visual object tracking is a critical task with many applications, such as robot scene perception [1] and self-driving [2]. It involves tracking the target objects in the sequential video frames based on the initial frame. Many efforts have been made to develop tracking algorithms with standard RGB cameras, however, these methods often fail under challenging conditions, e.g., low light.

Event cameras [3] are bio-inspired sensors with the merits of high dynamic range and no motion blur, which are complementary to conventional RGB cameras. The potential value of the complementarity between the RGB frames and event streams can help improve the robustness of tracking in many challenging visual conditions, e.g., extreme illumination variance, motion blur, and occlusion.

Refer to caption
Figure 1: An illustration of the core idea of the environmental MoE (eMoE) module. This module acts as a subtle router to fine-tune the frozen backbone encoder. The number of experts is determined by the attributes we decouple for the environmental conditions, and each expert is responsible for learning the attribute-specific features. All the learned features are assembled and added with the ones from the backbone encoder at the corresponding layer for robust representation of tracking.

This has inspired research endeavors in develo** event-guided, i.e., RGB-event (RGB-E) multi-modal tracking approaches [4, 5, 6, 7, 8, 9, 10]. These works can be divided into two categories based on the network structure: two-stream, i.e., siamese netwrok [5, 6, 7] and one-stream trackers [4, 9, 10]. The former takes two identical branches to process RGB and event modality separately. To better leverage the complementarity and increase the interaction between them, complex fusion modules are designed, thus leading to model complexity. The latter is based on the vision transformer (ViT) structure [11], where RGB and event tokens are concatenated and fed into the ViT backbone for feature encoding. Although they are free from the complex network structure, they fail to consider the impact of environmental attributes on tracking performance in challenging conditions. Meanwhile, no interaction between search and template features makes distinguishing target objects and backgrounds difficult. Consequently, performance degradation is induced especially in challenging conditions. Intuitively, we aim to address a novel research question: how to design a one-stream framework that can distinguish the environmental attributes while enabling feature interaction for robust tracking under diverse visual conditions?

In this paper, we propose a novel one-stream framework with an environmental Mixture-of-Experts structure (eMoE) along with a contrastive relation modeling (CRM) module to achieve robust tracking in challenging conditions, as shown in Fig. 1. The key insight is to disentangle the environmental attributes through learnable layers to dynamically learn the attribute-specific features for better interaction between the target objects and background. Specifically, to disentangle the environmental attributes, the eMoE module ((Sec. III-B)) is proposed to achieve two goals: (i) the environmental Attributes Disentanglementment and (ii) the environmental Attributes Gating. For the former, the eMoE module disentangles four attributes –illumination variance, motion blur, scale variance, and occlusion – to learn the attribute-specific features. This has been experimentally shown sufficient to reflect environmental effects on tracking given the advantages of event cameras (See Tab. IV). Then, each attribute-specific feature is assembled to build a more discriminative representation for tracking w.r.t.the attribute scores under the corresponding visual conditions, e.g., motion blur. For more efficient training, our eMoE module can be inserted into the arbitrary layers to fine-tune the ViT backbone encoder. For the latter, the CRM (Sec. III-B3) module aims to better distinguish the target features and background ones via contrastive learning. This subtly improves the interaction between the target template and search region and enhances the target objects. By integrating the eMoE and CRM modules, the output features can be more discriminative and less noisy for more robust tracking performance under diverse visual conditions.

To summarize, the contributions of our paper are three-fold: (I) We propose a novel research direction from environmental attributes to improve the tracking robustness and precision. (II) We introduce the environmental Mixture-of-Experts(eMoE) module to disentangle the environment into several learnable attributes for attribute-specific features and assemble them for more discriminative representation in RGB-event tracking tasks. (III) A contrastive relation modeling (CRM) module is designed to further increase the interaction between the search region and target template, thus enhancing the target object information under challenging conditions.

II Related Work

Visual Object Tracking (VOT). The mainstream deep trackers can be roughly categorized into two types based on structure: trackers with two-stream networks and with one-stream networks. Siamese-based trackers [12, 13, 14, 15, 16, 17, 18, 19] are the archetype two-stream networks, which are designed with two symmetrical branches to learn a similarity function between target template images and search region. On the other hand, trackers with one-stream networks [20, 21, 22, 23, 24] split the target template and search region into a set of tokens, concatenate them, and then feed to a fully-Transformer structure. Among them, MixFormer [20] introduces a set of mixed attention modules to extract and integrate the features of the target template and search region simultaneously and to obtain the discriminative target-specific features. Ostrack [22] proposes an early elimination module in the ViT encoder to discriminate the background tokens from the search region. We utilize the ViT backbone model [22] to build a one-stream RGB-E tracker by disentangling the environmental attributes while enabling effective interaction between search and target.

RGB-E Tracking. Daniel et al[25, 26] first tackles the problem of feature tracking using events and frames by develo** a maximum likelihood approach based on a generative event model.DashNet [27] later achieves an RGB-E tracker by designing the complementary filter and attention module. ESVM [28] incorporates event-based guiding methods into the support vector machine to improve tracking accuracy. Recently, Zhang et al[6] introduced self- and cross-domain attention with an adaptive weighting mechanism to fuse frames and events. Tang et al[4] proposes a one-stream one-stage RGB-E tracking framework to process feature extraction, fusion, matching, and interactive learning simultaneously. ViPT [10] exploits the modal-relevant prompts to fine-tune the pre-trained backbone model to adapt to multi-modal tracking tasks. However, these methods fail to consider the impact of complex environmental attributes on the tracking performance while only improving the cross-modal fusion. Differently, our eMoE-Tracker subtly disentangles the environment attributes into several learnable attributes to dynamically learn the attribute-specific features for better interaction and discriminability between the target and background regions.

III Method

Refer to caption
Figure 2: Overview of our proposed framework. The input of the whole network is the patch embeddings of RGB frames and stacked event frames. The concatenated two modal patches are fed into the backbone model and eMoE, and eMoE is inserted to L𝐿Litalic_L-th ViT to generate feature tokens which are combined with the tokens from the ViT encoder at the corresponding layer. Elsuperscript𝐸𝑙E^{l}italic_E start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT is the ViT encoder at layer l𝑙litalic_l. The CRM module gets the enhanced tokens to further improve the discriminability of target object.

III-A Problem Setting and Overview

III-A1 Problem Setting

Given an initial target bounding box B0subscript𝐵0B_{0}italic_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT in a video, the goal of an RGB-based tracker is to learn a tracking model TRGB:{IRGB,B0}B:subscript𝑇𝑅𝐺𝐵subscript𝐼𝑅𝐺𝐵subscript𝐵0𝐵T_{RGB}:\{I_{RGB},B_{0}\}\rightarrow Bitalic_T start_POSTSUBSCRIPT italic_R italic_G italic_B end_POSTSUBSCRIPT : { italic_I start_POSTSUBSCRIPT italic_R italic_G italic_B end_POSTSUBSCRIPT , italic_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT } → italic_B to estimate the bounding box in all subsequent frames IRGBsubscript𝐼𝑅𝐺𝐵I_{RGB}italic_I start_POSTSUBSCRIPT italic_R italic_G italic_B end_POSTSUBSCRIPT. In RGB-E tracking, event streams are introduced and stacked as event frames, extending the input to (IRGB,IE)subscript𝐼𝑅𝐺𝐵subscript𝐼𝐸(I_{RGB},I_{E})( italic_I start_POSTSUBSCRIPT italic_R italic_G italic_B end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT ), where the subscript E𝐸Eitalic_E indicates events. For details of representations for event data, we refer readers to [3, 29]. Therefore, the RGB-E tracking model can be represented as TRGBE:{IRGB,IE,B0}B:subscript𝑇𝑅𝐺𝐵𝐸subscript𝐼𝑅𝐺𝐵subscript𝐼𝐸subscript𝐵0𝐵T_{RGB-E}:\{I_{RGB},I_{E},B_{0}\}\rightarrow Bitalic_T start_POSTSUBSCRIPT italic_R italic_G italic_B - italic_E end_POSTSUBSCRIPT : { italic_I start_POSTSUBSCRIPT italic_R italic_G italic_B end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT , italic_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT } → italic_B. We choose the transformer encoder and decoder of  [4] as our backbone encoder and decoder for better efficiency. The archetypical structure of the backbone encoder and decoder can be represented as EDsubscript𝐸subscript𝐷\mathcal{F}_{E}\circ\mathcal{F}_{D}caligraphic_F start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT ∘ caligraphic_F start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT, where E:{IRGB,IE,B0}𝒯RGBE:subscript𝐸subscript𝐼𝑅𝐺𝐵subscript𝐼𝐸subscript𝐵0subscript𝒯𝑅𝐺𝐵𝐸\mathcal{F}_{E}:\{I_{RGB},I_{E},B_{0}\}\rightarrow\mathcal{T}_{RGB-E}caligraphic_F start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT : { italic_I start_POSTSUBSCRIPT italic_R italic_G italic_B end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT , italic_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT } → caligraphic_T start_POSTSUBSCRIPT italic_R italic_G italic_B - italic_E end_POSTSUBSCRIPT denotes the backbone encoder and D:𝒯RGBEB:subscript𝐷subscript𝒯𝑅𝐺𝐵𝐸𝐵\mathcal{F}_{D}:\mathcal{T}_{RGB-E}\rightarrow Bcaligraphic_F start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT : caligraphic_T start_POSTSUBSCRIPT italic_R italic_G italic_B - italic_E end_POSTSUBSCRIPT → italic_B represents the decoder which produces the estimated bounding box results. The main body of Dsubscript𝐷\mathcal{F}_{D}caligraphic_F start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT here is a vanilla vision transformer [11] containing 12 encoder layers. Each layer contains Multi-head Self-Attention (MSA), LayerNorm (LN), Feed-Forward Network (FFN) and residual connections. Before feeding inputs into the backbone network, RGB and event patches are projected into feature tokens adding with positional embeddings and then concatenated to RGB and event feature tokens 𝒯RGBE0=[𝒯RGBz,𝒯RGBx,𝒯Ez,𝒯Ex]superscriptsubscript𝒯𝑅𝐺𝐵𝐸0superscriptsubscript𝒯𝑅𝐺𝐵𝑧superscriptsubscript𝒯𝑅𝐺𝐵𝑥superscriptsubscript𝒯𝐸𝑧superscriptsubscript𝒯𝐸𝑥\mathcal{T}_{RGB-E}^{0}=[\mathcal{T}_{RGB}^{z},\mathcal{T}_{RGB}^{x},\mathcal{% T}_{E}^{z},\mathcal{T}_{E}^{x}]caligraphic_T start_POSTSUBSCRIPT italic_R italic_G italic_B - italic_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT = [ caligraphic_T start_POSTSUBSCRIPT italic_R italic_G italic_B end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT , caligraphic_T start_POSTSUBSCRIPT italic_R italic_G italic_B end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT , caligraphic_T start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT , caligraphic_T start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT ] as the inputs of the transformer encoder. Tokens through the l𝑙litalic_l-th encoder layer Elsuperscript𝐸𝑙E^{l}italic_E start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT can be represented as 𝒯RGBl1superscriptsubscript𝒯𝑅𝐺𝐵𝑙1\mathcal{T}_{RGB}^{l-1}caligraphic_T start_POSTSUBSCRIPT italic_R italic_G italic_B end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT. The final layer encoder output is denoted as 𝒯RGBELsuperscriptsubscript𝒯𝑅𝐺𝐵𝐸𝐿\mathcal{T}_{RGB-E}^{L}caligraphic_T start_POSTSUBSCRIPT italic_R italic_G italic_B - italic_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT.

III-B The proposed eMoE-Tracker

III-B1 Overview

An overview of our eMoE-Tracker is shown in Fig. 2. The RGB and event inputs are first projected into a sequence of tokens and then fed into backbone encoders and eMoE. The eMoE module aims to achieve environmental attributes disentanglement and environmental attributes gating. The backbone encoder layers are frozen and the parameters are not updated. The eMoE module disentangles the environmental attributes to learn the attribute-specific features.

Outputs from the eMoE module can be dynamically added to the tokens from the corresponding layer of the ViT backbone. Overall, the process can be formulated as follows:

𝒯l=𝒯RGBEl+𝒫l+1,l=1,2,,Lformulae-sequencesuperscript𝒯𝑙superscriptsubscript𝒯𝑅𝐺𝐵𝐸𝑙superscript𝒫𝑙1𝑙12𝐿\mathcal{T}^{l}=\mathcal{T}_{RGB-E}^{l}+\mathcal{P}^{l+1},\ l=1,2,...,Lcaligraphic_T start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT = caligraphic_T start_POSTSUBSCRIPT italic_R italic_G italic_B - italic_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT + caligraphic_P start_POSTSUPERSCRIPT italic_l + 1 end_POSTSUPERSCRIPT , italic_l = 1 , 2 , … , italic_L (1)1( 1 )

where 𝒫l+1superscript𝒫𝑙1\mathcal{P}^{l+1}caligraphic_P start_POSTSUPERSCRIPT italic_l + 1 end_POSTSUPERSCRIPT denotes token features from eMoE at l+1𝑙1l+1italic_l + 1 layer.

III-B2 eMoE Module

The eMoE module aims to achieve: i) environmental attributes disentanglement and ii) environmental attributes gating. We now describe them.

i. Environmental Attributes Disentanglement. As shown in Fig. 3, to enable better learning of the attribute-specific features under various challenging conditions, we manually annotate the visible-event datasets with four attribute labels including motion blur, illumination variance, scale variance, and occlusion. Then, a mixture-of-experts network with four identical branches is designed to learn the attribute-specific features for each challenging scenario. It allows us to capture more discriminative features and suppress the noises brought by other environmental attributes. All four expert networks employ the CONV-MLP-CONV structure but with different parameters. Specifically, considering the l𝑙litalic_l-th ViT layer, we assume that there are K𝐾Kitalic_K experts {fexpertl,i(𝒯l):N×DN×D,i[1,K]}conditional-setsuperscriptsubscript𝑓𝑒𝑥𝑝𝑒𝑟𝑡𝑙𝑖superscript𝒯𝑙formulae-sequencesuperscript𝑁𝐷superscript𝑁𝐷𝑖1𝐾\{f_{expert}^{l,i}(\mathcal{T}^{l}):\mathbb{R}^{N\times D}\rightarrow\mathbb{R% }^{N\times D},i\in[1,K]\}{ italic_f start_POSTSUBSCRIPT italic_e italic_x italic_p italic_e italic_r italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l , italic_i end_POSTSUPERSCRIPT ( caligraphic_T start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ) : blackboard_R start_POSTSUPERSCRIPT italic_N × italic_D end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_N × italic_D end_POSTSUPERSCRIPT , italic_i ∈ [ 1 , italic_K ] } to learn the attribute-specific features under the corresponding environmental condition, where l𝑙litalic_l denotes the layer of ViT backbone and i𝑖iitalic_i represents the index of experts. Through each expert, we generate a series of attribute-specific features {ilN×D,i[1,K]}formulae-sequencesuperscriptsubscript𝑖𝑙superscript𝑁𝐷𝑖1𝐾\{\mathcal{H}_{i}^{l}\in\mathcal{R}^{N\times D},i\in[1,K]\}{ caligraphic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ∈ caligraphic_R start_POSTSUPERSCRIPT italic_N × italic_D end_POSTSUPERSCRIPT , italic_i ∈ [ 1 , italic_K ] } by function fexpertl,i(𝒯l)superscriptsubscript𝑓𝑒𝑥𝑝𝑒𝑟𝑡𝑙𝑖superscript𝒯𝑙f_{expert}^{l,i}(\mathcal{T}^{l})italic_f start_POSTSUBSCRIPT italic_e italic_x italic_p italic_e italic_r italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l , italic_i end_POSTSUPERSCRIPT ( caligraphic_T start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT )

ii. Environmental Attribute Gating After obtaining the attribute-specific features under different decoupled environmental conditions, we should consider the different contributions of these features with the supervision of the ground truth attribute labels G=[G1,G2,,GK]𝐺subscript𝐺1subscript𝐺2subscript𝐺𝐾G=[G_{1},G_{2},...,G_{K}]italic_G = [ italic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_G start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_G start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ]. To achieve the goal, the gating network employs the CONV-BatchNorm-ReLU-CONV-Sigmoid structure with two loops and is fed by all the RGB and event patch tokens to generate K𝐾Kitalic_K attribute scores for attribute-specific features, where K𝐾Kitalic_K denotes the number of experts. The learnable score indicates the main challenging types in the corresponding scenario, therefore the attribute-specific feature with a larger score should have a higher contribution to the assembling features to achieve the robust representation under various challenging conditions. Moreover, it can suppress the noise from other environmental attributes. Specifically, at l𝑙litalic_l-th layer of the backbone, the attribute scores Wl,tsuperscript𝑊𝑙𝑡W^{l,t}italic_W start_POSTSUPERSCRIPT italic_l , italic_t end_POSTSUPERSCRIPT are generated from the gating network {fgl,t(𝒯l):N×DN×K}conditional-setsuperscriptsubscript𝑓𝑔𝑙𝑡superscript𝒯𝑙superscript𝑁𝐷superscript𝑁𝐾\{f_{g}^{l,t}(\mathcal{T}^{l}):\mathbb{R}^{N\times D}\rightarrow\mathbb{R}^{N% \times K}\}{ italic_f start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l , italic_t end_POSTSUPERSCRIPT ( caligraphic_T start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ) : blackboard_R start_POSTSUPERSCRIPT italic_N × italic_D end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_N × italic_K end_POSTSUPERSCRIPT }, where t𝑡titalic_t represent the index of experts. The assembling feature assemblelsuperscriptsubscript𝑎𝑠𝑠𝑒𝑚𝑏𝑙𝑒𝑙\mathcal{F}_{assemble}^{l}caligraphic_F start_POSTSUBSCRIPT italic_a italic_s italic_s italic_e italic_m italic_b italic_l italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT at layer l𝑙litalic_l can be formally calculated by t=1KWl,ttl.superscriptsubscript𝑡1𝐾superscript𝑊𝑙𝑡superscriptsubscript𝑡𝑙\sum_{t=1}^{K}W^{l,t}\mathcal{H}_{t}^{l}.∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_W start_POSTSUPERSCRIPT italic_l , italic_t end_POSTSUPERSCRIPT caligraphic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT .

Refer to caption
Figure 3: An illustration of our eMoE module. Here we take four expert branches as illustrations. The RGB and event tokens are fed into eMoE, which decouples the challenging attributes and generates attribute-specific representations under corresponding challenging conditions. Meanwhile, it is also responsible for dynamically weighing and assembling all the attribute-specific features to form a more discriminative representation for tracking.

III-B3 Contrastive Relation Modeling

Apart from the eMoE module hel** increase the discriminative ability of features under complicated scenes, we propose a CRM module to increase the interaction between search region and target template and enhance target information. To achieve it, let us assume that the features from the target template contain mainly target information. In contrast, the features from the search region include both the target and background information. We first fuse corresponding patch tokens into fused target template tokens and search region tokens to better build the relation on two modal data. After fusion, we create positive pairs between features of the target template and target features of the search region, and negative pairs between features of the target template and background features of the search region as shown in Fig. 4. This helps us pull the target template tokens near the target-related tokens while pushing background-related tokens away from the search region, thus improving the discriminability.

Refer to caption
Figure 4: An illustration of our CRM module. The RGB and event tokens are first fused into fused search region feature tokens and target template feature tokens. We exploit the contrastive learning strategy to pull the target information in search tokens near template feature tokens which push background information away from template feature tokens. The final goal is to make the tracking features more discriminative and unambiguous.

III-C Optimization

The main body of our RGB-E tracker FRGBEsubscript𝐹𝑅𝐺𝐵𝐸F_{RGB-E}italic_F start_POSTSUBSCRIPT italic_R italic_G italic_B - italic_E end_POSTSUBSCRIPT is initialized by the transformer-based tracking backbone. All the parameters θ𝜃\thetaitalic_θ we should update is only existing in eMoE and CRM. The optimization process can be formulated as

θ=argmin1|𝒟|(CRM(D(𝒯RGBEL)),Bgt),𝜃𝑎𝑟𝑔𝑚𝑖𝑛1𝒟𝐶𝑅𝑀subscript𝐷superscriptsubscript𝒯𝑅𝐺𝐵𝐸𝐿subscript𝐵𝑔𝑡\theta=argmin\frac{1}{|\mathcal{D}|}\sum\mathcal{L}(CRM(\mathcal{F}_{D}(% \mathcal{T}_{RGB-E}^{L})),B_{gt}),\ italic_θ = italic_a italic_r italic_g italic_m italic_i italic_n divide start_ARG 1 end_ARG start_ARG | caligraphic_D | end_ARG ∑ caligraphic_L ( italic_C italic_R italic_M ( caligraphic_F start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT ( caligraphic_T start_POSTSUBSCRIPT italic_R italic_G italic_B - italic_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT ) ) , italic_B start_POSTSUBSCRIPT italic_g italic_t end_POSTSUBSCRIPT ) , (2)2( 2 )

where |𝒟|𝒟|\mathcal{D}|| caligraphic_D | denotes the RGB-event data.

The overall objective function of our model includes tracking loss Ltrackingsubscript𝐿𝑡𝑟𝑎𝑐𝑘𝑖𝑛𝑔L_{tracking}italic_L start_POSTSUBSCRIPT italic_t italic_r italic_a italic_c italic_k italic_i italic_n italic_g end_POSTSUBSCRIPT, contrastive loss LNCEsubscript𝐿𝑁𝐶𝐸L_{NCE}italic_L start_POSTSUBSCRIPT italic_N italic_C italic_E end_POSTSUBSCRIPT and attribute loss Lattrsubscript𝐿𝑎𝑡𝑡𝑟L_{attr}italic_L start_POSTSUBSCRIPT italic_a italic_t italic_t italic_r end_POSTSUBSCRIPT. The tracking loss is the same as the transformer-based tracking backbone [22] as follows,

Ltracking=Lcls+λiouLiou+λL1L1subscript𝐿𝑡𝑟𝑎𝑐𝑘𝑖𝑛𝑔subscript𝐿𝑐𝑙𝑠subscript𝜆𝑖𝑜𝑢subscript𝐿𝑖𝑜𝑢subscript𝜆subscript𝐿1subscript𝐿1L_{tracking}=L_{cls}+\lambda_{iou}L_{iou}+\lambda_{L_{1}}L_{1}\ italic_L start_POSTSUBSCRIPT italic_t italic_r italic_a italic_c italic_k italic_i italic_n italic_g end_POSTSUBSCRIPT = italic_L start_POSTSUBSCRIPT italic_c italic_l italic_s end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT italic_i italic_o italic_u end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT italic_i italic_o italic_u end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT (3)3( 3 )

where Lclssubscript𝐿𝑐𝑙𝑠L_{cls}italic_L start_POSTSUBSCRIPT italic_c italic_l italic_s end_POSTSUBSCRIPT is the focal loss [30]for object classification, IoU loss [31] Liousubscript𝐿𝑖𝑜𝑢L_{iou}italic_L start_POSTSUBSCRIPT italic_i italic_o italic_u end_POSTSUBSCRIPT and L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT are exploited for bounding box regression, λiousubscript𝜆𝑖𝑜𝑢\lambda_{iou}italic_λ start_POSTSUBSCRIPT italic_i italic_o italic_u end_POSTSUBSCRIPT and λL1subscript𝜆subscript𝐿1\lambda_{L_{1}}italic_λ start_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT are regularization parameters. For more details,  [22] can be referred to. Additionally, we take the InfoNCE loss [32] as a contrastive learning loss for the CRM module. Given the fused target template tokens 𝒯fusedzsuperscriptsubscript𝒯𝑓𝑢𝑠𝑒𝑑𝑧\mathcal{T}_{fused}^{z}caligraphic_T start_POSTSUBSCRIPT italic_f italic_u italic_s italic_e italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT from the backbone ViT encoder, we compute the similarity S=[s1,s2,,sNx]𝑆superscript𝑠1superscript𝑠2superscript𝑠subscript𝑁𝑥S=[s^{1},s^{2},...,s^{N_{x}}]italic_S = [ italic_s start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_s start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , … , italic_s start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ] between 𝒯fusedzsuperscriptsubscript𝒯𝑓𝑢𝑠𝑒𝑑𝑧\mathcal{T}_{fused}^{z}caligraphic_T start_POSTSUBSCRIPT italic_f italic_u italic_s italic_e italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT and the fused search region tokens 𝒯fusedx=[tfusedx,1,tfusedx,2,,tfusedx,Nx]superscriptsubscript𝒯𝑓𝑢𝑠𝑒𝑑𝑥superscriptsubscript𝑡𝑓𝑢𝑠𝑒𝑑𝑥1superscriptsubscript𝑡𝑓𝑢𝑠𝑒𝑑𝑥2superscriptsubscript𝑡𝑓𝑢𝑠𝑒𝑑𝑥subscript𝑁𝑥\mathcal{T}_{fused}^{x}=[t_{fused}^{x,1},t_{fused}^{x,2},...,t_{fused}^{x,N_{x% }}]caligraphic_T start_POSTSUBSCRIPT italic_f italic_u italic_s italic_e italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT = [ italic_t start_POSTSUBSCRIPT italic_f italic_u italic_s italic_e italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_x , 1 end_POSTSUPERSCRIPT , italic_t start_POSTSUBSCRIPT italic_f italic_u italic_s italic_e italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_x , 2 end_POSTSUPERSCRIPT , … , italic_t start_POSTSUBSCRIPT italic_f italic_u italic_s italic_e italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_x , italic_N start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ]. where τ𝜏\tauitalic_τ is the temperature parameter and si=sim(𝒯fusedz,tfusedx,i)/τsuperscript𝑠𝑖𝑠𝑖𝑚superscriptsubscript𝒯𝑓𝑢𝑠𝑒𝑑𝑧superscriptsubscript𝑡𝑓𝑢𝑠𝑒𝑑𝑥𝑖𝜏s^{i}=sim(\mathcal{T}_{fused}^{z},t_{fused}^{x,i})/\tauitalic_s start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = italic_s italic_i italic_m ( caligraphic_T start_POSTSUBSCRIPT italic_f italic_u italic_s italic_e italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT , italic_t start_POSTSUBSCRIPT italic_f italic_u italic_s italic_e italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_x , italic_i end_POSTSUPERSCRIPT ) / italic_τ. Based on that, the search region tokens which contain the information inside the ground-truth bounding box can be selected as positive and the similarity score is defined as spsubscript𝑠𝑝s_{p}italic_s start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT and the left are negative pairs including Nnegsubscript𝑁𝑛𝑒𝑔N_{neg}italic_N start_POSTSUBSCRIPT italic_n italic_e italic_g end_POSTSUBSCRIPT, the similarity score is set as [snk]k=1Nnegsuperscriptsubscriptdelimited-[]superscriptsubscript𝑠𝑛𝑘𝑘1𝑁𝑛𝑒𝑔[s_{n}^{k}]_{k=1}^{Nneg}[ italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ] start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N italic_n italic_e italic_g end_POSTSUPERSCRIPT. The contrastive learning loss can be formulated as:

NCE=log(espesp+k=1Nnegesnk)subscript𝑁𝐶𝐸𝑙𝑜𝑔superscript𝑒subscript𝑠𝑝superscript𝑒subscript𝑠𝑝superscriptsubscript𝑘1𝑁𝑛𝑒𝑔superscript𝑒superscriptsubscript𝑠𝑛𝑘\mathcal{L}_{NCE}=-log(\frac{e^{s_{p}}}{e^{s_{p}}+\sum_{k=1}^{Nneg}e^{s_{n}^{k% }}})\ caligraphic_L start_POSTSUBSCRIPT italic_N italic_C italic_E end_POSTSUBSCRIPT = - italic_l italic_o italic_g ( divide start_ARG italic_e start_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG start_ARG italic_e start_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUPERSCRIPT + ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N italic_n italic_e italic_g end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT end_ARG ) (4)4( 4 )

For the attribute loss Lattrsubscript𝐿𝑎𝑡𝑡𝑟L_{attr}italic_L start_POSTSUBSCRIPT italic_a italic_t italic_t italic_r end_POSTSUBSCRIPT, we utilize L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT term to measure the distance between the estimated attribute scores and the ground truth labels. It can be formulated as:

Lattr=tWl,tGtl1subscript𝐿𝑎𝑡𝑡𝑟subscript𝑡subscriptnormsuperscript𝑊𝑙𝑡superscript𝐺𝑡subscript𝑙1L_{attr}=\sum_{t}||W^{l,t}-G^{t}||_{l_{1}}\ italic_L start_POSTSUBSCRIPT italic_a italic_t italic_t italic_r end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | | italic_W start_POSTSUPERSCRIPT italic_l , italic_t end_POSTSUPERSCRIPT - italic_G start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT | | start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT (5)5( 5 )

The total objective can be formulated as follows:

=Ltracking+αLNCE+βLattrsubscript𝐿𝑡𝑟𝑎𝑐𝑘𝑖𝑛𝑔𝛼subscript𝐿𝑁𝐶𝐸𝛽subscript𝐿𝑎𝑡𝑡𝑟\mathcal{L}=L_{tracking}+\alpha L_{NCE}+\beta L_{attr}\ caligraphic_L = italic_L start_POSTSUBSCRIPT italic_t italic_r italic_a italic_c italic_k italic_i italic_n italic_g end_POSTSUBSCRIPT + italic_α italic_L start_POSTSUBSCRIPT italic_N italic_C italic_E end_POSTSUBSCRIPT + italic_β italic_L start_POSTSUBSCRIPT italic_a italic_t italic_t italic_r end_POSTSUBSCRIPT (6)6( 6 )
TABLE I: Experimental results on VisEvent dataset. The best results are shown in bold.
Tracker Ocean [33] SiamCAR [34] SiamRPN++ [35] ATOM [36] PrDiMP [37] LTMU [38] FENet [6] AFNet [7] OSTrack [22] CEUTrack [4] ViPT [10] eMoE-Tracker(Ours)
SR 23.26 34.49 33.66 31.34 37.39 37.05 44.2 44.5 53.4 55.58 59.2 61.3
PR 52.02 58.86 60.58 60.45 64.47 66.76 58.9 59.3 69.5 69.06 75.8 76.4
NPR 54.21 62.99 64.72 63.41 67.02 69.78 61.2 62.5 72.6 73.0 73.2 79.6
TABLE II: Experimental results on COESOT dataset. The best results are shown in bold.
Tracker MixFormer1k [20] STARK-S50 [39] PrDiMP50 [37] PrDiMP18 [37] ATOM [36] SiamRPN [13] AiATrack [40] TrSiam [41] OSTrack [22] CEUTrack [4] ViPT [10] eMoE-Tracker(Ours)
SR 56.0 55.7 57.9 56.7 55.0 53.5 59.0 59.7 59.0 62.7 65.3 67.1
PR 62.8 62.6 65.0 62.9 63.6 61.1 67.4 66.3 66.6 70.9 73.7 79.9
NPR 61.7 61.6 64.0 62.6 63.0 62.8 65.6 65.8 62.8 73.9 76.4 82.3

IV Experiment

Refer to caption
Figure 5: Visualization of attention maps from the backbone network compared with our eMoE-Tracker. Four challenging conditions including illumination variance, motion blur, occlusion and scale variance are selected to reflect the effectiveness of environmental attributes disentanglement and feature assembling.

IV-A Experimental Settings

Our method is trained end-to-end on 1 NVIDIA A800 GPU with PyTorch implementation. During training, our method utilizes a global batch size of 64 and takes 60 epochs each epoch processes 6×1046superscript1046\times 10^{4}6 × 10 start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT sample pairs. We employ the AdamW [42] optimizer with a weight decay of 104superscript10410^{-4}10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT and set the initial learning rate as 2×1042superscript1042\times 10^{-4}2 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT but decreasing after 32 epochs by the factor of 10.

To demonstrate the effectiveness of our method, we use two RGB-E datasets: VisEvent [8] and COESOT [4]. Details of the datasets can be found in their papers. We compare with some RGB-E trackers, including ViPT [10], CEUTrack [4], FENet [6], AFNet [7] and many other RGB-based trackers with two-modal input, e.g., SiamRPN++ [35], ATOM [36], STARK [39], MixFormer [20], etc. We adopt three metrics to evaluate trackers’ performance, including precision rate (PR), success rate (SR), and normalized precision rate (NPR).

Refer to caption
Figure 6: Visualization on the score head maps from backbone model, ViPT [10] and our model eMoE-Tracker. The red, green and yellow boxes denote the bounding box of the backbone network, ViPT [10] and eMoE-Tracker, respectively. (a)RGB search frame. (b)Stacked event search frame.(c)Score maps from the backbone network. (d) Score maps from ViPT. (e) Score maps from eMoE-Tracker.
Refer to caption
Figure 7: Overall performance on COESOT under four challenging attributes, including illumination variation, motion blur, scale variance and full occlusion.

IV-B Comparison

Evaluation on VisEvent. We evaluate our method on the VisEvent dataset compared to SOTA trackers with two modal inputs. Note that we employ the stacked event frames while not the raw event streams as input in our model. The quantitative results are illustrated in TABLE I. Our method is superior to other SOTA trackers, which achieve 61.3%, 76.4%, and 79.6% on the metrics of SR, PR, and NPR, respectively. Surprisingly, our method surpasses the backbone network by 6.2% and 6.9% on SR and PR and exceeds existing RGB-E SOTA ViPT by 0.4% and 0.6% respectively on SR and PR, which demonstrates the effectiveness of our method.

Evaluation on COESOT. COESOT is the largest real-world visible-event benchmark dataset. We compare with 12 RGB-E trackers to evaluate the effectiveness of our method. We report our results in TABLE II. As observed, our method achieves the best performance among all the trackers, with the figure of 67.1%, 79.9%, and 82.3% on SR, PR, and NPR. Additionally, eMoE-Tracker shows a gain of 1.8% on SR and 6.2% on PR respectively, and also outperforms the backbone network by a large margin. It demonstrates that our algorithm achieves the SOTA performance on the COESOT dataset.

Visualization. Qualitative results are provided in Fig. 5 and Fig. 6. Specifically, Fig. 5 shows the attention maps from the backbone network and eMoE-Tracker, where our model can generate a more discriminative response under some complex scenarios, e.g., scale variance. In Fig. 6, a more precise location of the target can be provided by eMoE-Tracker compared with the backbone network and ViPT [10].

TABLE III: Ablation studies on the effectiveness of our proposed modules: eMoE and CRM.
Model eMoE CRM VisEvent COESOT
SR PR SR PR
Backbone 53.4 69.5 59.0 66.6
59.2 75.8 65.8 75.0
61.3 76.4 67.1 79.9
TABLE IV: Ablation studies on the number of experts.
The number of experts VisEvent COESOT
SR PR SR PR
1 54.2 70.8 60.6 67.1
2 58.6 71.6 62.1 70.9
3 59.0 73.5 63.4 72.6
4 61.3 76.4 67.1 79.9

IV-C Ablation Studies

Effectiveness of eMoE and CRM. To validate the effectiveness of our proposed modules, we perform the ablation study on the VisEvent and COESOT datasets. We implement four comparison experiments inside the network. They are: 1) backbone network 2) backbone network with eMoE; 3) backbone network with eMoE and CRM. The ablation studies can be found in TABLE III. As observed, the best performance happens when we combine all the proposed modules into the backbone network. For the VisEvent dataset, with the incorporation of eMoE and CRM, our method outperforms the backbone model by 7.9% and 6.9% on metrics SR and PR, respectively. Additionally, adding the CRM module to model ①, the SR and PR gain improvements of 2.1% and 0.6%, respectively. The results showcase the effectiveness of the proposed modules.

Analysis on environmental attributes. In the dataset COESOT, there are 17 challenging environmental attributes are annotated to help analyze the performance under different challenging conditions. Here we illustrate the overall performance on the COESOT dataset for the disentangled four challenging attributes: illumination variation, motion blur, scale variance, and full occlusion. In Fig. 7, we can find our proposed model eMoE-Tracker outperforms the backbone model and ViPT [10]. Specifically, it achieves 63.6% in occlusion, 77.9% in illumination variance, 73.5% in motion blur, and 80.7% in scale variance on PR, respectively. The results demonstrate that our algorithm is effective in improving tracking precision and robustness under various challenging scenarios.

Inserted layers of eMoE We achieve RBG-E tracking under various challenging conditions by injecting visual prompt blocks into different layers of the backbone model. It is intuitive to investigate the effect on the number of prompt blocks. Here we set different insert intervals for blocks to insert and the intervals are 1,2 4, 6, and 12. Therefore, the first means that all the layers are fully inserted and the last one only inserts the blocks in the last high-level layer.

TABLE V: Ablation studies on inserted intervals into the backbone encoders.
Inserted intervals VisEvent COESOT
SR PR SR PR
1 61.3 76.4 67.1 79.9
2 60.0 75.8 66.3 76.1
4 58.1 72.6 63.4 72.5
6 55.8 72.0 61.1 70.9
12 54.7 70.3 60.3 68.0

Analysis on the number of experts. Due to the complicated environmental attributes, it is worthwhile to consider the impact of the number of experts on tracking performance. Intuitively, more experts are more powerful at addressing complex environments and decomposing them into environmental attributes for easier tracking. However, on one hand, too many experts increase the burden on the model because of increasing parameters, on the other hand, it might result in an overfitting phenomenon. Therefore, we conduct ablation studies on the number of experts, and the results can be found in TABLE IV.

Analysis on the model complexity. As we mentioned previously, trackers with the two-stream structure, e.g.siamese-based trackers, suffer from model complexity due to the high demand for multi-modal fusion. To evaluate the superiority of one stream tracker on network complexity, we employ the computational cost to represent the model complexity. We calculate the number of trainable parameters on some two stream trackers, like, and our proposed eMoE-Tracker for the comparison. Results are reported in TABLE VI.

TABLE VI: Model complexity Comparison between two stream trackers with our eMoE-Tracker. OS and TS denote one stream and two streams, respectively.
Tracker AFNet [7] FENet [7] VisEvent [8] eMoE-Tracker
Structure-type TS TS TS OS
Trainable Parameters(MB) 25.16 41.87 27.53 8.42

V CONCLUSIONS

In this work, we proposed eMoE-Tracker, a one-stream transformer-based tracking model by introducing mixture-of-experts structure and contrastive learning scheme to RGB-E tracker under various challenging conditions. Extensive experiments on benchmark visible-event datasets VisEvent and COESOT demonstrate the robustness and effectiveness of eMoE-Tracker for RGB-E tracking under challenging conditions like motion blur, illumination variance and etc. We can gain insights from the results that the tracking performance degradation in challenging conditions can be alleviated by explicitly considering tracking tasks from an environmental attributes perspective.

Limitations. Despite the superior performance for RGB-E tracking, the limitation of our model is highly dependent on the manual disentanglement annotation for the environmental attributes, thus put restrictions on the generalization of the model. Additionally, the real-world scenarios are complicated and hard to fully consider all the environmental attributes manually. In the future, we expect to learn an agent to obtain the environmental attributes in a learnable manner for multi-modal tracking tasks.

References

  • [1] T. Ran, L. Yuan, and J. Zhang, “Scene perception based visual navigation of mobile robot in indoor environment,” ISA transactions, vol. 109, pp. 389–400, 2021.
  • [2] X. Dai, X. Yuan, and X. Wei, “Tirnet: Object detection in thermal infrared images for autonomous driving,” Applied Intelligence, vol. 51, no. 3, pp. 1244–1261, 2021.
  • [3] X. Zheng, Y. Liu, Y. Lu, T. Hua, T. Pan, W. Zhang, D. Tao, and L. Wang, “Deep learning for event-based vision: A comprehensive survey and benchmarks,” arXiv preprint arXiv:2302.08890, 2023.
  • [4] C. Tang, X. Wang, J. Huang, B. Jiang, L. Zhu, J. Zhang, Y. Wang, and Y. Tian, “Revisiting color-event based tracking: A unified network, dataset, and metric,” arXiv preprint arXiv:2211.11010, 2022.
  • [5] J. Zhang, K. Zhao, B. Dong, Y. Fu, Y. Wang, X. Yang, and B. Yin, “Multi-domain collaborative feature representation for robust visual object tracking,” The Visual Computer, vol. 37, no. 9, pp. 2671–2683, 2021.
  • [6] J. Zhang, X. Yang, Y. Fu, X. Wei, B. Yin, and B. Dong, “Object tracking by jointly exploiting frame and event domain,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 13043–13052, 2021.
  • [7] J. Zhang, Y. Wang, W. Liu, M. Li, J. Bai, B. Yin, and X. Yang, “Frame-event alignment and fusion network for high frame rate tracking,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9781–9790, 2023.
  • [8] X. Wang, J. Li, L. Zhu, Z. Zhang, Z. Chen, X. Li, Y. Wang, Y. Tian, and F. Wu, “Visevent: Reliable object tracking via collaboration of frame and event flows,” IEEE Transactions on Cybernetics, 2023.
  • [9] Z. Zhu, J. Hou, and D. O. Wu, “Cross-modal orthogonal high-rank augmentation for rgb-event transformer-trackers,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 22045–22055, 2023.
  • [10] J. Zhu, S. Lai, X. Chen, D. Wang, and H. Lu, “Visual prompt multi-modal tracking,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 9516–9526, 2023.
  • [11] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020.
  • [12] L. Bertinetto, J. Valmadre, J. F. Henriques, A. Vedaldi, and P. H. Torr, “Fully-convolutional siamese networks for object tracking,” in Computer Vision–ECCV 2016 Workshops: Amsterdam, The Netherlands, October 8-10 and 15-16, 2016, Proceedings, Part II 14, pp. 850–865, Springer, 2016.
  • [13] B. Li, J. Yan, W. Wu, Z. Zhu, and X. Hu, “High performance visual tracking with siamese region proposal network,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8971–8980, 2018.
  • [14] Y. Yu, Y. Xiong, W. Huang, and M. R. Scott, “Deformable siamese attention networks for visual object tracking,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 6728–6737, 2020.
  • [15] Z. Chen, B. Zhong, G. Li, S. Zhang, and R. Ji, “Siamese box adaptive network for visual tracking,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 6668–6677, 2020.
  • [16] Y. Xu, Z. Wang, Z. Li, Y. Yuan, and G. Yu, “Siamfc++: Towards robust and accurate visual tracking with target estimation guidelines,” in Proceedings of the AAAI conference on artificial intelligence, vol. 34, pp. 12549–12556, 2020.
  • [17] L. Bertinetto, J. Valmadre, S. Golodetz, O. Miksik, and P. H. S. Torr, “Staple: Complementary learners for real-time tracking,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
  • [18] Z. Zhu, Q. Wang, B. Li, W. Wu, J. Yan, and W. Hu, “Distractor-aware siamese networks for visual object tracking,” in Proceedings of the European conference on computer vision (ECCV), pp. 101–117, 2018.
  • [19] G. Wang, C. Luo, Z. Xiong, and W. Zeng, “Spm-tracker: Series-parallel matching for real-time visual object tracking,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 3643–3652, 2019.
  • [20] Y. Cui, C. Jiang, L. Wang, and G. Wu, “Mixformer: End-to-end tracking with iterative mixed attention,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 13608–13618, 2022.
  • [21] B. Chen, P. Li, L. Bai, L. Qiao, Q. Shen, B. Li, W. Gan, W. Wu, and W. Ouyang, “Backbone is all your need: A simplified architecture for visual object tracking,” in European Conference on Computer Vision, pp. 375–392, Springer, 2022.
  • [22] B. Ye, H. Chang, B. Ma, S. Shan, and X. Chen, “Joint feature learning and relation modeling for tracking: A one-stream framework,” in European conference on computer vision, pp. 341–357, Springer, 2022.
  • [23] J.-P. Lan, Z.-Q. Cheng, J.-Y. He, C. Li, B. Luo, X. Bao, W. Xiang, Y. Geng, and X. Xie, “Procontext: Exploring progressive context transformer for tracking,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5, IEEE, 2023.
  • [24] X. Chen, B. Yan, J. Zhu, D. Wang, X. Yang, and H. Lu, “Transformer tracking,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 8126–8135, 2021.
  • [25] D. Gehrig, H. Rebecq, G. Gallego, and D. Scaramuzza, “Asynchronous, photometric feature tracking using events and frames,” in Proceedings of the European Conference on Computer Vision (ECCV), pp. 750–765, 2018.
  • [26] D. Gehrig, H. Rebecq, G. Gallego, and D. Scaramuzza, “Eklt: Asynchronous photometric feature tracking using events and frames,” International Journal of Computer Vision, vol. 128, no. 3, pp. 601–618, 2020.
  • [27] Z. Yang, Y. Wu, G. Wang, Y. Yang, G. Li, L. Deng, J. Zhu, and L. Shi, “Dashnet: A hybrid artificial and spiking neural network for high-speed object tracking,” arXiv preprint arXiv:1909.12942, 2019.
  • [28] J. Huang, S. Wang, M. Guo, and S. Chen, “Event-guided structured output tracking of fast-moving objects using a celex sensor,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 28, no. 9, pp. 2413–2417, 2018.
  • [29] G. Gallego, T. Delbrück, G. Orchard, C. Bartolozzi, B. Taba, A. Censi, S. Leutenegger, A. J. Davison, J. Conradt, K. Daniilidis, et al., “Event-based vision: A survey,” IEEE transactions on pattern analysis and machine intelligence, vol. 44, no. 1, pp. 154–180, 2020.
  • [30] H. Law and J. Deng, “Cornernet: Detecting objects as paired keypoints,” in Proceedings of the European conference on computer vision (ECCV), pp. 734–750, 2018.
  • [31] H. Rezatofighi, N. Tsoi, J. Gwak, A. Sadeghian, I. Reid, and S. Savarese, “Generalized intersection over union: A metric and a loss for bounding box regression,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 658–666, 2019.
  • [32] A. Jaiswal, A. R. Babu, M. Z. Zadeh, D. Banerjee, and F. Makedon, “A survey on contrastive self-supervised learning,” Technologies, vol. 9, no. 1, p. 2, 2020.
  • [33] Z. Zhang, H. Peng, J. Fu, B. Li, and W. Hu, “Ocean: Object-aware anchor-free tracking,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXI 16, pp. 771–787, Springer, 2020.
  • [34] D. Guo, J. Wang, Y. Cui, Z. Wang, and S. Chen, “Siamcar: Siamese fully convolutional classification and regression for visual tracking,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 6269–6277, 2020.
  • [35] B. Li, W. Wu, Q. Wang, F. Zhang, J. Xing, and J. Yan, “Siamrpn++: Evolution of siamese visual tracking with very deep networks,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 4282–4291, 2019.
  • [36] M. Danelljan, G. Bhat, F. S. Khan, and M. Felsberg, “Atom: Accurate tracking by overlap maximization,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 4660–4669, 2019.
  • [37] M. Danelljan, L. V. Gool, and R. Timofte, “Probabilistic regression for visual tracking,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 7183–7192, 2020.
  • [38] K. Dai, Y. Zhang, D. Wang, J. Li, H. Lu, and X. Yang, “High-performance long-term tracking with meta-updater,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 6298–6307, 2020.
  • [39] B. Yan, H. Peng, J. Fu, D. Wang, and H. Lu, “Learning spatio-temporal transformer for visual tracking,” in Proceedings of the IEEE/CVF international conference on computer vision, pp. 10448–10457, 2021.
  • [40] S. Gao, C. Zhou, C. Ma, X. Wang, and J. Yuan, “Aiatrack: Attention in attention for transformer visual tracking,” in European Conference on Computer Vision, pp. 146–164, Springer, 2022.
  • [41] N. Wang, W. Zhou, J. Wang, and H. Li, “Transformer meets tracker: Exploiting temporal context for robust visual tracking,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 1571–1580, 2021.
  • [42] I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” arXiv preprint arXiv:1711.05101, 2017.

Appendix

VI Dataset

VI-A Datasets

We evaluate the effectiveness of our eMoE-Tracker through extensive experiments on two visible-event benchmark datasets: VisEvent [8] and COESOT [4].

VisEvent.VisEvent dataset contains 820 video sequence pairs including 37,128 RGB frames in total, and the minimum, maximum, and average frame lengths are 18, 6246, and 450 frames, respectively. The frame rate of RGB videos is around 25 FPS. The training subset contains 500 video sequences while the testing subset contains 320 video sequences. In the VisEvent dataset, there are 17 attributes defined, reflecting the scenarios under different lighting conditions, such as LI (Low Illumination), OE (Over Exposure), and IV (Illumination Variation).

COESOT. COESOT dataset is the largest benchmark dataset for RGB-event single object tracking. It comprises 1354 aligned video sequences captured by a DAVIS346 event camera, and the training subset contains 827 videos and the testing subset contains 527 videos, respectively. Similar to the VisEvent dataset, there are also 17 attributes annotated to help evaluate the performance of trackers under diverse scenarios.

Attributes Annotation. In our eMoE-Tracker, we manually annotate the video sequences into a 4-digit vector according to environmental conditions. In particular, a video sequence is labeled as [illumination variation, motion blur, scale variance, occlusion] = [1,1,0,0]1100[1,1,0,0][ 1 , 1 , 0 , 0 ], which means that the environmental condition in this video contains illumination variation and motion blue while without scale variance and occlusion. All the video sequences are labeled in this manner according to the RGB ones.

VII Details of Experiments

VII-A Implementation

The eMoE-Tracker is trained on 1 NVIDIA A800 GPU with Pytorch implementation. The frozen ViT backbone structure is the same as the one in ViPT [10] and we pre-train the backbone ViT from scratch. We introduce four expert branches in this work, and they are with respect to illumination variation, motion blur, scale variance and occlusion. All four experts have the same structure and are initialized following a truncated normal distribution.

VII-B Ablation Studies

The number of experts. We represent the tracking results with the experts’ number of 1,2,3,4 in TABLE IV. When the number of experts is less than four, the attributes are randomly selected from the four pre-defined attributes and there should be more combinations to evaluate. Moreover, due to the manual annotations in existing settings, we are unable to evaluate the model with four more experts. This is the limitation of manual annotation for extension.

Model complexity. We show four trackers’ trainable parameters to compare their complexity. Our eMoE-Tracker is with one-stream structure while others are two-stream ones. It should be clarified that a one-stream tracker with transformer structure is supposed to have more trainable parameters, e.g., CEUTrack [4] has 96MB trainable parameters. However, our model eMoE-Tracker can gain better performance with less trainable parameters in one-stream trackers, resulting from the frozen ViT backbone reducing the trainable parameters.

VIII Additional Evaluation Results

VIII-A Visualization

As shown in Fig. 5, we illustrate the attention maps from the backbone network and our eMoE-Tracker, which is from the last layer of the ViT encoder. In this part, we show more attention map results from layer 7 to 12 in Fig. 8. From the attention maps from layer 7th to layer 12th, it is obvious that the responses from our eMoE-Tracker are clearer no matter in shallow or deep layers.

Refer to caption
Figure 8: The visualization of the attention maps of the 7th 12th layer from the backbone network and eMoE-Tracker. The left two columns are RGB and event search regions and the rest are the attention maps from two networks. The upper row is from backbone network while the lower one is from eMoE-Tracker.

VIII-B Attributes Performance

Since the VisEvent and COESOT dataset all provide 17 attributes for tracking performance evaluation, we can leverage the Matlab toolkit to plot the diagram of curves about the attributes performance comparison. We represent the diagram of curves on 16 attributes except for the no motion scenario in Fig. 10 and Fig. 9, which show the superior performance under all the environmental conditions compared to other SOTA priors.

Refer to caption
Figure 9: The precision diagram of curves on attributes performance on COESOT dataset. Here illustrates 16 attributes performance, including Background Object Motion, Over Exposure, Illumination Variation, Rotation, Fast Motion, Aspect Ration Change, Motion Blur, Background Clutter, Scale Varation, Viewpoint Change, Partial Occlusion, Out-of-View, Low Illumination, Full Occlusion, Deformation, and Camera Motion.
Refer to caption
Figure 10: The precision diagram of curves on attributes performance on VisEvent dataset. Here illustrates 16 attributes performance, including Background Object Motion, Over Exposure, Illumination Variation, Rotation, Fast Motion, Aspect Ration Change, Motion Blur, Background Clutter, Scale Varation, Viewpoint Change, Partial Occlusion, Out-of-View, Low Illumination, Full Occlusion, Deformation, and Camera Motion.

IX Conclusion

In supplementary material, we provide more details on datasets, experiments, and performance results for the evaluation. All the results show the effectiveness of our proposed eMoE-Tracker for RGB-event tracking. In the future, we are expected to extend the environmental expert branches dynamically and design the agent to detect the environmental conditions while not in a manual manner.