TENet: Targetness Entanglement Incorporating with Multi-Scale Pooling and Mutually-Guided Fusion for RGB-E Object Tracking

Pengcheng Shao Tianyang Xu Zhangyong Tang Linze Li Xiao-Jun Wu wu˙[email protected] Josef Kittler School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi 214122, China Centre for Vision, Speech and Signal Processing (CVSSP), University of Surrey, Guildford GU2 7XH, UK

Abstract

There is currently strong interest in improving visual object tracking by augmenting the RGB modality with the output of a visual event camera that is particularly informative about the scene motion. However, existing approaches perform event feature extraction for RGB-E tracking using traditional appearance models, which have been optimised for RGB only tracking, without adapting it for the intrinsic characteristics of the event data. To address this problem, we propose an Event backbone (Pooler), designed to obtain a high-quality feature representation that is cognisant of the innate characteristics of the event data, namely its sparsity. In particular, Multi-Scale Pooling is introduced to capture all the motion feature trends within event data through the utilisation of diverse pooling kernel sizes.The association between the derived RGB and event representations is established by an innovative module performing adaptive Mutually Guided Fusion (MGF). Extensive experimental results show that our method significantly outperforms state-of-the-art trackers on two widely used RGB-E tracking datasets, including VisEvent and COESOT, where the precision and success rates on COESOT are improved by 4.9% and 5.2%, respectively. Our code will be available at https://github.com/SSSpc333/TENet.

keywords:

RGB-E object tracking, multi-scale pooling, mutually-guided fusion

^†^†journal: Journal of LaTeX Templates

1 Introduction

Refer to caption — Figure 1: Comparisons of the proposed Pooler with CNN and Transformer methods. The CNN-based method uses parallel branches to extract the template and search features. The Transformer-based method utilises a unified attention block to perform relation modelling of the template and search tokens. Our pipeline takes into account the sparsity property of event images. “Event-only” and “RGB-only” signify that the input contains only one specific modality.

Visual object tracking aims to detect and locate the target, specified in the initial frame of a video, in the subsequent video frames. It supports a wide range of applications, such as automatic driving, surveillance, UAV navigation [1]. The existing technology based on RGB images is impressive in good imaging conditions. However, RGB images can be degraded by, for instance, limited illumination, or over-exposure. To address these issues, the current tracking community has been investigating the merits of event data [2, 3, 4], that can heighten the awareness of the changing light intensity, in the context of robust visual tracking. Although event data offers great advantages in perceiving motion, it is unable to capture the visual appearance of the target object, such as colour and texture information. Therefore, visual object tracking, combining RGB and event data has been gaining increasing attention.

Unfortunately, the inherent disparities between the RGB and event data modalitites pose notable challenges in devising effective strategies for the complementary use of the two modalities. The two primary challenges are: (1) The effective extraction of robust event features. The discrepancies between RGB and event sensors pull their distributions to different domains. However, existing trackers process the sparse event data using the tools borrowed from the RGB data analysis, yielding a sub-optimal solution. (2) How to harness the advantages of both modalities. Notably, an advanced tracker, ViPT [5] introduces a strategy in which a pre-trained RGB model is fine-tuned [6] to learn event-related prompts [7, 8], resulting in commendable performance. However, it still employs a simple addition operator to aggregate the features from both modalities, failing to achieve effective fusion of the information from the two modalities.

To address the aforementioned challenges, a novel end-to-end RGB-E tracking network is proposed. By taking into account the sparsity of event data, we can extract high-quality event features through a dedicated design of an event feature extraction backbone. The complementary information conveyed by the two modalities can effectively be fused by each modality focusing on the data it is competent to extract. Specifically, (1) we introduce a novel Pooling-based Event backbone that prioritises the position and shape of the moving object. The sparsity characteristic of event images causes some regions to have high pixel values, compared to others, while other regions assume zero pixel values. The pooling operation is conceived to concentrate on these sparse pixel values as a way to retain the crucial information. As shown in Figure 1(d), when the input is just the event modality, our Pooler outperforms the conventional CNN and Transformer solututions, which are typically employed in the current practices. In contrast, on the data captured by the RGB modality, the performance of our Pooler falls short, compared to the two established solutions. These experiments conclusively support the basic premise of our approach to handling event data. Our Event-specific Pooler excels in effectively extracting and capturing high-quality event features, setting it apart from traditional RGB backbones. (2) To enable effective information fusion of both modalities, we introduce a Mutually-Guided Fusion module (MGF). Our MGF allows each modality to inject the relevant information from the other source using a cross-attention mechanism, thereby improving their feature representation. The results of extensive experiments conducted on various RGB-E datasets validate our proposed TENet by significantly outperforming existing state-of-the-art RGB-E tracking methods.

In conclusion, our contributions can be summarised as follows:

1.

A novel, lightweight Pooling-based event feature extraction backbone that incorporates a multi-scale pooling operation to extract informative event features. The dedicated Pooler is instrumental in preserving motion clues and target contour, while ignoring the target appearance.
2.

A cross-attention based Mutually-Guided Fusion module, which enables both modalities to concentrate on features that are prominent in their respective sensing domains, and fuses them effectively.
3.

An extensive experimental validation demonstrates that our proposed TENet surpasses the performance of state-of-the-art trackers on the VisEvent and COESOT datasets, confirming the merit of the proposed fusion of the appearance and motion information.

2 Related Works

2.1 Visual Tracking Frameworks

Recently, visual object tracking [9, 10] has made tremendous progress thanks to the relentless advances in deep learning [11]. The online discriminative trackers [12, 13] learn a classifier to locate the object from the background. A landmark in tracking has been created by the adoption of Siamese-based trackers [14, 15, 16], incorporating a regional proposal network structured around the Siamese architecture and its design variations, aiming to enhance the features used for tracking. The Transformer-based trackers [17, 18, 19] employ the attention mechanism [20] to capture long-term relational dependency between the template and the search region, promoting better accuracy and efficiency in long term visual object tracking.

While advanced models exhibit excellent tracking performance, a common characteristic among most models is their reliance on comparing templates with semantically similar search regions. This process of matching is inevitably affected by the instantaneous imaging characteristics of the input pairs. In specific scenarios, such as overexposure and high-speed motion, where image quality is compromised, it significantly degrades the model performance. Consequently, researchers are keen to integrate multi-modal [21, 22, 23] inputs to mitigate the limitations of a single imaging modality. The incorporation of the event modality in object tracking offers the means of providing complementary information that should enhance the model’s adaptability to complex real-world scenarios.

2.2 RGB-E Tracking Solutions

RGB-E tracking [24, 25] becomes increasingly popular due to the superiority of event data in perceiving motions, providing high-precision dynamic timestamps. This multimodal tracking proves highly effective in dealing with fast-moving objects, with notable resilience to challenging illumination conditions. To benefit from the complementary sources of information, Zhang et al. [26] introduced a cross-domain feature integrator that adeptly fuses feature information from the two domains, utilising a cross-domain attention mechanism. They also developed a voxel-based event pre-processing approach and constructed an extensive event-based dataset. Following the fusion study, Wang et al. [27] presented a cross-modal Transformer module designed to integrate Event data and RGB data. To exploit the highly dynamic nature of event data, AFNet [28] harnessed the elevated temporal resolution of event data to achieve high frame rate tracking. The model incorporates a cross-modal style alignment module, a cross-frame rate alignment module, and a cross-correlation fusion structure, facilitating a comprehensive RGB-E fusion. Zhu et al. [29] employed the concept of Masked AutoEncoder (MAE [30]) to selectively mask RGB tokens and event tokens, enhancing interactions between cross-modal tokens. Additionally, they incorporated orthogonal high-rank regularisation to mitigate the network fluctuations induced by the masking process. To enable non-RGB to RGB transfer learning, ViPT [5] presents a comprehensive framework for multi-modal tracking, incorporating spatial attention to model interactions between RGB tokens and non-RGB prompts. This methodology fine-tunes a pre-trained base model to improve performance. Their approach demonstrates the considerable potential of leveraging diverse visual cue learning in the realm of multi-modal tracking.

From the above overview it is apparent that existing studies employ a homogeneous shared-weight backbone for the extraction of features by both the event and RGB modalities. This overlooks the divergent characteristics of the event and RGB data that are evident from the nature of their acquisition processes: event data, captured by an event camera, records asynchronous brightness changes within the scene at an irregular frame rate, while RGB data, obtained through an RGB camera, captures images at a consistent and fixed frame rate. We argue, therefore, that it is crucial to design separate, modality specific processing for the event modality to effectively exploit its unique properties. As the process of fusion adopted by the above methods is handcrafted, it will also be necessary to revisit the methodology of fusing the two modalities to ensure that the complementary characteristics of these modalities are properly integrated.

Typically, the event modality exhibits sparsity, with only a limited number of events being detected from time to time. To reflect these characteristics, we design a novel Pooling-Based architecture, devising an extraction backbone for event features that capture the motion information from adjacent regions. On the multi-modal fusion side, we integrate a cross-attention mechanism, enabling each modality to selectively focus on features relevant to its specific domain. We shall demonstrate that this design facilitates a mutual reinforcement of domain specific features. As the attention mechanism within a Transformer framework excels at dynamically capturing dependencies across diverse modalities, its investigation will be at the top of the list of options for amalgamating the multi-modal information.

3 The Proposed Method

3.1 Network Overview

The overall architecture of our TENet is shown in Figure 2. Our Network contains four components: Two-modality asymmetric backbone, Mutually-Guided Fusion module (MGF), the relation modelling module, and the head. Given the inputs, we first utilise the transformer backbone to extract RGB features and employ our Pooler to extract event features. Subsequently, we apply the MGF to promote mutual reinforcement of the features from both modalities. The features from the template and the search region of the two modalities are progressively fused. Next the fused template features and search region features are concatenated to facilitate the relation modelling. Finally, the acquired tokens are input to the tracking head for object localisation.

3.2 Pooling-based event feature extraction backbone

Our Pooler is designed to extract high-quality event features for subsequent fusion with the features of both modalities. Our design takes into account the event data properties. It differs from the RGB modalities in several respects. Notable distinctions are its sparsity and its heightened sensitivity to motion. These properties motivate the use of a Pooling operation for extracting the motion information from event data, serving as a valuable complement to the target appearance provided by the RGB data. Given that some pixels in sparse event images are of high value, the Pooling operation is able to focus on such regions, while averaging out the noise and insignificant changes. Moreover, by adopting a Multi-Scale pooling, our feature aggregation captures different feature scales. The multi-level approach enhances the robustness of representations for sparse features by capturing fine-grained details at smaller scales and encapsulating abstract information at larger scales, thereby enabling effective handling of various feature distributions. Traditional convolution operations are dedicated to extracting detailed semantic features of the object, whereas event-based modalities concentrate on capturing the dynamic aspects of object movement. Thereby,our Pooler is more committed to a clear depiction of motion.

Our Pooler is divided into 3 stages. Event data often includes numerous events, but not all are task-relevant. In fact, there is a significant number of noise events. The role of the first stage is first to apply MaxPooling to the event data in order to identify crucial events and mitigate noise events so as to capture salient feature representations. Subsequently, we employ AvgPooling to capture low-level information within the event data. In the final step, the gathered low-level and high-level event data are merged to create a more comprehensive event representation. The event features $F_{E_{1}}$ obtained in Stage 1 can be expressed as follows:

\left\{\begin{aligned} &F_{E_{f}}={\emph{M}}({{\varphi}_{1\times 1}}(E)),\\ &F_{E_{i}}={\emph{M}}({{\varphi}_{1\times 1}}(F_{E_{f}})),\\ &F_{E_{j}}={\emph{A}}({{\varphi}_{1\times 1}}(F_{E_{f}})),\\ &F_{E_{1}}=F_{E_{i}}+F_{E_{j}},\end{aligned}\right.

(1)

where $E$ represents the original event images. $F_{E_{f}}$ , $F_{E_{i}}$ , and $F_{E_{j}}$ correspond to the features obtained during Stage 1. ${\varphi}_{1\times 1}$ represents a ${1\times 1}$ convolutional layer. M represents MaxPooling operation and A represents AvgPooling operation.

The event camera captures events with a microsecond-level latency, providing real-time responses to rapid changes in high-speed scenarios and tracking the changes in object position and location. To effectively handle objects appearing at different positions and moving with distinct speeds, we introduce Multi-Scale Pooling (MSP) operations in Stage 2 and Stage 3. MSP adjusts the scale of the analysis based on the variations in object velocity to capture the changes in the object more accurately. This module utilises pooling operations with different kernel sizes, allowing the network to achieve a range of feature representations that comprehensively capture motion, from coarse to fine-grained detail.

More specifically, in Stage 2, as depicted in Figure 3, we partition the features produced in Stage 1 into four groups based on their channel dimensions. The first group is initialised with a ${3\times 3}$ kernel, while the kernel size for each subsequent group is increased by 2. This approach permits each group to possess a unique receptive field, thereby effectively capturing changes in both the object and the background. Subsequently, as each group contains information of varying granularity, we apply a ${1\times 1}$ convolution to all groups to aggregate the extracted information across groups, promoting global information cross-fertilisation. The feature $F_{E_{2}}$ obtained in Stage 2 can be expressed as follows:

F_{E_{i}}={{\varphi}_{1\times 1}}({{\varphi}_{2\times 2}}(F_{E_{1}})),

(2)

		$\displaystyle F_{E_{j}}=\emph{C}oncat(M_{{k_{1}}\times{k_{1}}}(\textit{x}_{1})% ,M_{{k_{2}}\times{{k_{2}}}}(\textit{x}_{2}),$		(3)
		$\displaystyle M_{{k_{3}}\times{{k_{3}}}}(\textit{x}_{3}+G_{2}),...,M_{{k_{n}}% \times{k_{n}}}(\textit{x}_{n}+G_{n-1})),$		(3)

F_{E_{2}}={\varphi}_{1\times 1}(F_{E_{j}})+F_{E_{i}},

(4)

where $F_{E_{i}}$ and $F_{E_{j}}$ denote the features obtained in Stage 2. $x=\{x_{1},x_{2},...,x_{n}\}$ refers to the feature of each group along the spatial dimension of $F_{E_{j}}$ , which is partitioned into $n$ groups. $k_{i}=\{3,5,...,K\}$ denotes the kernel size of each group, which is gradually increased by 2. $G_{n}$ signifies the result of the $n$ -th group processing, following the MaxPooling operation.

For intuitive visualisation, we select two representative scenes to validate the effectiveness of our Pooler. As shown in Figure 4(a) and Figure 4(b), four different scales of pooling operations are utilised to focus on objects in event images in both, Scene 1 and Scene 2. Groups 1 and 2 in both scenes are capable of concentrating on fine-grained objects and the surrounding environment. Groups 3 and 4 experience a certain degree of feature loss as a result of the use of a large convolutional kernel in the pooling operation, but this helps to capture coarse features, that are spatially imprecise. The features extracted at each scale are then aggregated to acquire a comprehensive representation. From Figure 4(b) and Figure 4(e), the variations in objects at various scales, along with background elements, are aggregated to make it possible to segregate the object from the background. From both Stage 2 and Stage 3 in Figure 4(c) and Figure 4(f), our MSP successfully differentiates the object from the background, thereby effectively extracting event features.

3.3 Mutually-Guided Fusion module

In general, the fundamental task of RGB-E tracking lies in the synergistic fusion of event motion cues and RGB appearance information. In existing approaches, ViPT [5] involves a linear map** of the two-modality data to tokens and subsequently employing a basic addition fusion. However, this straightforward fusion method falls short in unlocking the complete potential of event and RGB data, overlooking their unique advantages. Other methods design the model structure manually, taking the specific Event data characteristics into account, but this is likely to be less effective than a solution learnt by a purposeful fusion architecture.

To enhance the utilisation of event data, we introduce a mutually guided cross-modal fusion module into our model, namely MGF. This module leverages cross-attention mechanisms, enabling the salient features of one modality to integrate with those of the other, thereby strengthening each modality and fostering a more cohesive inter-modal collaboration. To minimise the spatial complexity associated with the aforementioned attention operations, we downsize the width and height of keys and values to the $k$ -th percentile of their original sizes. Specifically, the attention map obtained in the original Transformer [20] according to (5) is $hw\times hw$ . Following the downsampling operation, the dimensions of the attention map are altered to $hw\times(hw/k)$ . In input (ii) in Figure 5, we show the process of enhancing the RGB modality with the event modality as an example. The input features from the two branches are initially subjected to layer normalisation, followed by applying the cross-attention mechanism to process the tokens from both modalities. In this process, the vectorised features of the event modality are used as queries $Q_{Event}$ and the vectorised features of RGB modality are used as keys $K_{RGB}$ and values $V_{RGB}$ , respectively. This process can be represented by the following equation:

{\emph{A}ttention}(Q_{Event},K_{RGB},V_{RGB})=softmax(\frac{Q_{Event}{(K_{RGB}% )^{T}}}{d_{k}})V_{RGB},

(5)

where $d_{k}$ is the dimension of keys( $K_{RGB}$ ). This formula describes how to use query tokens of the event modality to pay attention to the key tokens of the RGB modality, in order to focus on the associations of these two modalities, and then use RGB value tokens to further enhance the similarities. Finally, the original RGB tokens are added to the output of the attention operation. The combined result is then passed on to the MLP for further processing. Enhancing the event modality with the RGB modality follows a similar process, with the key difference being that it involves swap** the roles, namely utilising the vectorised features of the RGB modality as queries ( $Q_{RGB}$ ) and utilising the vectorised features of the event modality as keys ( $K_{Event}$ ) and values ( $V_{Event}$ ).

3.4 Relationship modelling for fusing template and search

Following the fusion of features from the RGB and event modalities, the subsequent step involves independently summing up the template and search region tokens for each of these modalities. This operation serves to preserve the independence of the two modalities while efficiently integrating their information. Prior to modelling the feature relationships, we concatenate the template tokens and search region tokens along the spatial dimension to create a unified representation. Subsequently, we use four layers of the standard ViT [31] architecture to conduct relationship modelling on the combined tokens. At each layer, we utilise a Multi-Head Self-Attention (MHSA) module to calculate self-attention among the combined tokens. This involves computing the cross-attention between the template tokens and the search region tokens, after combining the two modalities to capture their interrelationships. Finally, a split operation is applied to separate the search region tokens for the purpose of object localisation. The overall process of relationship modelling can be depicted as follows:

\left\{\begin{aligned} &F_{T}=fu_{E\_T}+fu_{R\_T},\\ &F_{S}=fu_{E\_S}+fu_{R\_S},\\ &{f_{l}}^{0}=Concat(F_{T},F_{S}),\\ &{{f_{l}}^{i^{\prime}}}={{f_{l}}^{i}}+MSA(LN({{f_{l}}^{i}})),\\ &{{f_{l}}^{i+1}}={{f_{l}}^{i^{\prime}}}+MLP(LN({f_{l}}^{i^{\prime}})),\\ &{{final_{S}}^{I}}=DeConcat(f^{I}),\end{aligned}\right.

(6)

where $l$ stands for the $l$ -th layer and $I$ for the last layer.

4 Evaluation

4.1 Implementation details.

In our approach, we employ the feature extraction backbone for the RGB modality [17], which has been pre-trained on tracking datasets. Our implementation is carried out using Python 3.7 and PyTorch 1.9.0. The network is trained over 60 epochs, utilising the AdamW optimiser with default settings. During training, we use a batch size of 32, and the initial learning rate is set to 0.0001. The network training is conducted on a single RTX 3090 GPU.

4.2 Comparison with the state-of-the-art Trackers

We evaluate the performance of our network on two RGB-E benchmarks, namely VisEvent and COESOT. For evaluation, we use precision rate (PR) and success rate (SR) as the measurement metrics. It is worth noting that both datasets were recorded using a DAVIS 346 camera with a resolution of 346 × 230 pixels. Importantly, we exclusively use event images derived from the original event data and do not utilise event streams.

Results on VisEvent. VisEvent is presently among the extensively utilised datasets, comprising 820 pairs of videos captured in environments characterised by low illumination, high dynamics, and background clutter. The dataset is partitioned into 500 training subsets and 320 test subsets. As depicted in Figure 6, in comparison to the state-of-the-art tracker ViPT [5], our method enhances the precision rate (PR) and success rate (SR) by 0.7% and 0.9% to reach 76.5% and 60.1%, respectively. Notably, our approach achieves an inference speed that is approximately 40% faster than ViPT. Both the experimental results and the improved inference speed underscore the effectiveness and efficiency of our method.

Results on COESOT. COESOT stands as a generic single object tracking dataset designed for colour event cameras, containing 1354 colour event videos and 478,721 RGB frames. The dataset is categorised into 827 training subsets and 527 testing subsets. Table 1 reports the results of a comprehensive evaluation of our method on the COESOT dataset, showing a commendable performance in both accuracy and success rate. Specifically, our proposed method attains a precision rate (PR) of 76.8% and a success rate (SR) of 68.4%, surpassing the state-of-the-art method, HRCeutrack, by 4.9% and 5.2%, respectively. This represents a substantial performance enhancement in the event tracking task.

Table 1: An overall comparison on the COESOT benchmark. We adopt four commonly used metrics for the comparison, i.e., PR, SR, normalised precision rate (NPR), and breakOut capability score (BOC). “EI” stands for event images. “EVox” represents the voxel form of the raw event data.

Models	Modality	SR	PR	NPR	BOC
Mixformer1k [32]	RGB+EI	56.0	62.8	61.7	17.2
PrDiMP50 [33]	RGB+EI	57.9	65.0	64.0	17.5
KYS [34]	RGB+EI	58.6	66.7	65.7	17.9
DiMP50 [35]	RGB+EI	58.9	67.1	65.9	18.1
AiATrack [36]	RGB+EI	59.0	67.4	65.6	19.0
KeepTrack [37]	RGB+EI	59.6	66.1	65.1	18.1
TOMP50 [38]	RGB+EI	59.8	66.7	65.7	18.3
TrDiMP [39]	RGB+EI	60.1	66.9	65.8	18.4
SuperDiMP [40]	RGB+EI	60.2	67.0	66.0	18.5
TransT50 [41]	RGB+EI	60.5	67.9	66.6	18.5
SiamR-CNN [42]	RGB+EI	60.9	67.5	66.3	19.1
CEUTrack [43]	RGB+EVox	62.0	70.5	69.0	20.8
HRCEUTrack [29]	RGB+EVox	63.2	71.9	70.2	21.6
TENet (Ours)	RGB+EI	68.4	76.8	75.3	24.2

4.3 Discussion of the experiments evaluating the “Pooler”

Based on the sparsity and discreteness of the data conveyed by the event frames, we propose a lightweight multi-scale pooling mechanism, termed Pooler, which captures object motion information across different scales. In order to illustrate the advantages of Pooler over existing lightweight backbones and traditional multi-scale pooling methods, we compare its features with the output features of MobileNet [44] and SPP [45] (Spatial Pyramid Pooling). The configuration of the MobileNet and SPP competitor is illustrated in Figure 2. Basically, Pooler is replaced by MobileNetv3, and Stage2 and Stage3 in Pooler are substituted by SPP. As shown in Table 2, overall, TENet delivers superior performance, compared to MobileNetv3 and SPP. In VisEvent, our approach outperforms MobileNetv3 by 1.5% and 1.4%, and surpass SPP by 2.4% and 2.0%. Similarly, on COESOT, our method is better than MobileNetv3 by 1.1% and 0.9%, and exceeds SPP by 1.5% and 1.3%. To gain intuitive understanding of the strength of the Pooler performance, we visualize the feature maps from sample scenes in Figure 7, outputted by MobileNetV3, SPP and Pooler. From the analysis of Scene 1, Scene 2, and Scene 4, it is evident that our Pooler is able to distinguish the foreground from the background effectively. In Scene 1, even when the object is partially obscured, our Pooler highlights the unobscured parts of the object successfully, showcasing its robust tracking ability in complex occlusion scenarios. In both Scene 2 and Scene 4, our Pooler again accurately delineates the foreground from the background. In Scene 3, the Pooler highlights the object outline, facilitating a better targeted object feature extraction. This enhances the machine perception of the object shape and spatial structure. In contrast, it is apparent that the lightweight MobileNet struggles to distinguish the foreground from the background clearly, when handling background interference. It is challenging to discern the object explicitly from its feature maps. Similarly, SPP tends to encounter difficulties in capturing the foreground, whereas our multi-scale Pooler, thanks to its receptive fields originating from tap** the information conveyed by different channels, exhibits stronger performance.

Table 2: Performance comparison with different event backbones. Method 3 is our TENet.

Methods	Event Backbone			FLOPS	FPS	VisEvent		COESOT
Methods	MobileNet	SPP	Pooler	FLOPS	FPS	PR	SR	PR	SR
1	✓			34.615 G	23.02 fps	75.0	58.7	75.7	67.5
2		✓		36.374 G	37.77 fps	74.1	58.1	75.3	67.1
3			✓	35.157 G	44.13 fps	76.5	60.1	76.8	68.4

4.4 Ablation Study

The influence of the synergistic effects of the Event backbone (Pooler) and Mutually-Guided Fusion (MGF). To validate the effectiveness of Pooler and MGF, ablation experiments are conducted on both datasets. Table 3 presents the results of the experiments. In the absence of Pooler, we substitute the feature extraction backbones of both modalities with OSTrack [17], and share the weights during training.

In Method 1, the absence of both, our Pooler and MGF, results in a noteworthy decline in performance. Specifically, VisEvent shows a decrease of 1.9% and 1.9% in Precision (PR) and Recall (SR), while COESOT exhibits a reduction of 1.5% and 1.2% in PR and SR, respectively. Moreover, removing Pooler and MGF leads to a 43.48% increase in the computational effort of the model and a 22.11% decrease in inference speed.

Method 2, which retains our Pooler but discards MGF, results in a performance boost over Method 1. The model exhibits a significantly reduced computational effort, accompanied by a notable increase in inference speed. Pooler is designed to be sensitive to the sparsity of event images and has, therefore, the capacity to recognise that certain pixels within an event hold crucial information. Consequently, the informativeness of its features is much better than that of the event features obtained by the RGB backbone.

Method 3 retains MGF but removes Pooler, producing results which are better than those of Method 1. The Mutually-Guided Fusion (MGF) module models the dynamic relationship between the modalities by means of an adaptive strategy. This is achieved by invoking cross-attention to selectively highlight features within a modality that are pertinent to the tracking task, thus enhancing the representation computed by the modality. In Method 4, the retention of both, the Pooler module and the MGF module, produces the best results. The intricate coupling of the high-quality event features extracted by the Pooler module with the RGB features, promoted by our MGF module, results in a synergistic fusion that enhances the overall quality of representation.

Table 3: Ablation study of the Pooler and MGF modules. “✓” or “✗” signifies whether this module is retained or removed in the experimental setup. “✓ Pooler” indicates that our Pooler is removed and instead, the backbone, (OSTrack [17]), employed for the RGB modality, is used for extracting features for the event modality. Method 4 is our TENet.

Methods	Pooler	MGF	Macs	FPS	VisEvent		COESOT
Methods	Pooler	MGF	Macs	FPS	PR	SR	PR	SR
1	✗	✗	50.444 G	34.37 fps	74.6	58.2	75.3	67.2
2	✓	✗	31.565 G	51.03 fps	75.0	58.8	75.7	67.4
3	✗	✓	54.036 G	28.91 fps	75.2	58.7	75.8	67.6
4	✓	✓	35.157 G	44.13 fps	76.5	60.1	76.8	68.4

The merit of MGF fusion. We validate our Mutually-Guided Fusion module by selectively removing MGF(ii) or MGF(i) in TENet and presenting the results of retraining in Table 4. In method 3, the MGF(i) module is removed, and the MGF(ii) module is retained. There is a slight decrease in Precision (PR) and Recall (SR) across both datasets. The appearance information captured by the RGB features is combined with the motion information conveyed by the event features, contributing to the enhancement of object motion consistency.

In the case of Method 4, when we insert the MGF(i) module, with the MGF(ii) module absent, the PR and SR performance on VisEvent drops by 1.9% and 1.5%, respectively, while on COESOT, the PR and SR decrease by 2.3% and 0.8%, respectively. This decline affirms that the fusion of event features with RGB features combines object appearance and event motion effectively, providing a substantial performance boost to the tracker. When both, the MGF (i) and MGF (ii) modules are removed, the tracker experiences a considerable drop in performance on both datasets. This decline underscores the merits of the mutual guidance and enhancement of the two modalities, emphasising their crucial role in amalgamating the RGB appearance information and event motion information to achieve better performance.

Table 4: Ablation study of the Mutually-Guided Fusion module. “MGF(ii)” denotes the use of event tokens as queries, while RGB tokens serve as both keys and values in the attention mechanism. “MGF(i)” denotes the use of RGB tokens as queries, while event tokens serve as both keys and values.

Methods	MGF(ii)		MGF(i)		VisEvent		COESOT
Methods	w	w/o	w	w/o	PR	SR	PR	SR
1	✓		✓		76.5	60.1	76.8	68.4
2		✓		✓	75.0	58.8	75.7	67.4
3	✓			✓	76.1	59.5	76.5	68.2
4		✓	✓		74.6	58.6	74.5	67.6

The effect of downsampling on the Mutually-Guided Fusion module.

Table 5: An ablation study of the Relation Modelling block. “RM block” denotes the Relation Modelling block.

	RM block		VisEvent		COESOT
Methods	w/o	w	PR	SR	PR	SR
1	✓		74.1	58.1	73.7	65.1
2		✓	76.5	60.1	76.8	68.4
Drops			-2.4	-2.0	-3.1	-3.3

In order to reduce the computational complexity of the model and enhance the inference speed, downsampling operations are applied to the width and height of keys and values in both MGF(ii) and MGF(i). As shown in Table 6, in Method 1, without downsampling, the VisEvent yields a PR of 75.8% and SR of 59.3%, while COESOT results in a PR of 76.1% and SR of 67.7%. Introducing downsampling in Method 2, with a downsampling rate of 4, leads to improvements of 0.7% in PR and 0.8% in SR for both datasets, compared to Method 1. Additionally, Method 2 reduces the computational cost by 0.567 $G$ and speeds up inference by 4.31 $fps$ , compared to Method 1. Compared to Method 3, downsampling by a factor of 17, Method 2 demonstrates superior performance and faster inference speed on both datasets. The results from these three experiments conclusively demonstrate that downsampling the width and the height of keys and values accelerates the model’s inference process, leading to a notable enhancement in performance.

Table 6: An ablation study of the effect of the Downsampling rates in the Mutually-Guided Fusion module. “K” represents the downsampling rates.

Models	Macs	FPS	MGF(ii)			MGF(i)			VisEvent		COESOT
			K			K			PR	SR	PR	SR
			w/o	4	17	w/o	4	17	PR	SR	PR	SR
1	35.724 G	39.82 fps	✓			✓			75.8	59.3	76.1	67.7
2	35.157 G	44.13 fps		✓			✓		76.5	60.1	76.8	68.4
3	35.017 G	40.43 fps			✓			✓	75.0	58.4	76.3	68.0

The impact of modelling the relation between the fused template and the search region. We validate the effectiveness of the Relation Modelling Block by removing it. The results obtained are presented in Table 5. From the comparison between Method 1 and Method 2, it can be seen that the absence of the Relation Modelling Block causes a significant decrease in the performance of the model. These results indicate that the Relation Modelling Block plays a significant role in integrating the object information conveyed by the fused template into the fused search region.

Visualization. To validate the effectiveness of our fusion module, we visualize several representative feature maps and score maps. For the fast-moving object in Figure 8(a) is scattered in the RGB feature map and inaccurate in the event feature map. After their mutual enhancement, both modalities succeed in locating the object accurately. For the objects in Figure 8(b) and (c) in the Over/Under exposure scenes, the objects are almost invisible in the RGB images, but are relatively distinct in the event images. After being guided by the event features, the invisible objects are noted. As a result, our MGF promotes a more accurate and robust object localisation.

Finally, we substitute the Pooler module in the event image modality by a backbone network homologous to the RGB branch, and visualise the corresponding feature maps and score maps. The second column of Figure 9 shows the features of both modalities extracted using the same kind of backbone, specifically OSTrack [17]. The features of the two modalities extracted using heterogeneous backbones are shown in the third column. Specifically, the event features are extracted by our Pooler, while the RGB features are extracted by OSTrack [17]. The event features extracted by the RGB backbone exhibit limited efficacy in distinguishing the object region. In contrast, the features derived from the Pooler are distinctly clear and effectively accentuate the object area. In spite of their mutual enhancement by our MGF module, the highlighted portions fail to achieve seamless alignment with the objects in the homogeneous score maps. In contrast, in the heterogeneous score maps, the highlighted part locates the object precisely.

5 Conclusion

In this paper, we propose an end-to-end RGB-E single object tracking network. Our method is composed of two key components: An Event backbone (Pooler) performing Multi-Scale Pooling and a Mutually-Guided Fusion (MGF) module. The innovative Pooler excels in event feature extraction by leveraging the intrinsic characteristics of the event modality. The proposed MGF module capitalises on the synergies between the modalities by enriching one with insights from the other. Thorough ablation validation conclusively demonstrates the effectiveness of our Pooler and MGF. Our approach surpasses the state-of-the-art performance both on the Visevent and COESOT datasets. The proposed TENet is the first work taking into considerations the sparseness property of the event modality, as well as the real-time tracking requirements.

Acknowledgements

This work is supported in part by the National Key Research and Development Program of China (2023YFF1105102, 2023YFF1105105), the National Natural Science Foundation of China (Grant NO. 62020106012, 62332008, 62106089, U1836218, 62336004), the 111 Project of Ministry of Education of China (Grant No.B12018), and the UK EPSRC (EP/N007743/1,MURI/EPSRC
/DSTL, EP/R018456/1).

References

[1] J. Wen, H. Chu, Z. Lai, T. Xu, L. Shen, Enhanced robust spatial feature selection and correlation filter learning for uav tracking, Neural Networks 161 (2023) 39–54.
[2] G. Gallego, T. Delbrück, G. Orchard, C. Bartolozzi, B. Taba, A. Censi, S. Leutenegger, A. J. Davison, J. Conradt, K. Daniilidis, et al., Event-based vision: A survey, T-PAMI 44 (1) (2020) 154–180.
[3] X. Zheng, Y. Liu, Y. Lu, T. Hua, T. Pan, W. Zhang, D. Tao, L. Wang, Deep learning for event-based vision: A comprehensive survey and benchmarks, arXiv preprint arXiv:2302.08890.
[4] F. Tang, B. Niu, G. Zong, X. Zhao, N. Xu, Periodic event-triggered adaptive tracking control design for nonlinear discrete-time systems via reinforcement learning, Neural Networks 154 (2022) 43–55.
[5] Z. Jiawen, l. Simiao, C. Xin, D. Wang, H. Lu, Visual prompt multi-modal tracking, in: CVPR, 2023.
[6] H. Bahng, A. Jahanian, S. Sankaranarayanan, P. Isola, Exploring visual prompts for adapting large-scale models, arXiv preprint arXiv:2203.17274.
[7] M. Jia, L. Tang, B.-C. Chen, C. Cardie, S. Belongie, B. Hariharan, S.-N. Lim, Visual prompt tuning, in: ECCV, Springer, 2022, pp. 709–727.
[8] J. Yang, Z. Li, F. Zheng, A. Leonardis, J. Song, Prompting for multi-modal tracking, in: ACM MM, 2022, pp. 3492–3500.
[9] L. Bertinetto, J. Valmadre, J. F. Henriques, A. Vedaldi, P. H. S. Torr, Fully-convolutional siamese networks for object tracking, in: ECCV Workshops, 2016.
[10] T. Xu, Z.-H. Feng, X.-J. Wu, J. Kittler, Learning adaptive discriminative correlation filters via temporal consistency preserving spatial feature selection for robust visual object tracking, T-IP 28 (11) (2019) 5596–5609.
[11] J. Schmidhuber, Deep learning in neural networks: An overview, Neural networks 61 (2015) 85–117.
[12] M. Paul, M. Danelljan, C. Mayer, L. Van Gool, Robust visual tracking by segmentation, in: ECCV, Springer, 2022, pp. 571–588.
[13] T. Xu, Z. Feng, X.-J. Wu, J. Kittler, Adaptive channel selection for robust visual object tracking with discriminative correlation filters, IJCV 129 (2021) 1359–1375.
[14] B. Li, J. Yan, W. Wu, Z. Zhu, X. Hu, High performance visual tracking with siamese region proposal network, in: CVPR, 2018, pp. 8971–8980.
[15] B. Li, W. Wu, Q. Wang, F. Zhang, J. Xing, J. Yan, Siamrpn++: Evolution of siamese visual tracking with very deep networks, CVPR (2019) 4277–4286.
[16] T. Xu, Z. Feng, X.-J. Wu, J. Kittler, Toward robust visual object tracking with independent target-agnostic detection and effective siamese cross-task interaction, T-IP 32 (2023) 1541–1554.
[17] B. Ye, H. Chang, B. Ma, S. Shan, X. Chen, Joint feature learning and relation modeling for tracking: A one-stream framework, in: ECCV, 2022.
[18] B. Yan, H. Peng, J. Fu, D. Wang, H. Lu, Learning spatio-temporal transformer for visual tracking, in: CVPR, 2021, pp. 10448–10457.
[19] F. Xie, C. Wang, G. Wang, Y. Cao, W. Yang, W. Zeng, Correlation-aware deep tracking, in: CVPR, 2022, pp. 8751–8760.
[20] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin, Attention is all you need, NIPS 30.
[21] Z. Tang, T. Xu, H. Li, X.-J. Wu, X. Zhu, J. Kittler, Exploring fusion strategies for accurate rgbt visual object tracking, Information Fusion (2023) 101881.
[22] Z. Tang, T. Xu, X. Zhu, X.-J. Wu, J. Kittler, Generative-based fusion mechanism for multi-modal tracking (2023). arXiv:2309.01728.
[23] X.-F. Zhu, T. Xu, Z. Tang, Z. Wu, H. Liu, X. Yang, X.-J. Wu, J. Kittler, Rgbd1k: A large-scale dataset and benchmark for rgb-d object tracking, in: AAAI, Vol. 37, 2023, pp. 3870–3878.
[24] J. Zhang, Y. Wang, W. Liu, M. Li, J. Bai, B. Yin, X. Yang, Frame-event alignment and fusion network for high frame rate tracking, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 9781–9790.
[25] H. Chen, D. Suter, Q. Wu, H. Wang, End-to-end learning of object motion estimation from retinal events for event-based object tracking, in: AAAI, Vol. 34, 2020, pp. 10534–10541.
[26] J. Zhang, X. Yang, Y. Fu, X. Wei, B. Yin, B. Dong, Object tracking by jointly exploiting frame and event domain, in: ICCV, 2021, pp. 13043–13052.
[27] X. Wang, J. Li, L. Zhu, Z. Zhang, Z. Chen, X. Li, Y. Wang, Y. Tian, F. Wu, Visevent: Reliable object tracking via collaboration of frame and event flows, arXiv preprint arXiv:2108.05015.
[28] J. Zhang, Y. Wang, W. Liu, M. Li, J. Bai, B. Yin, X. Yang, Frame-event alignment and fusion network for high frame rate tracking, in: CVPR, 2023, pp. 9781–9790.
[29] Z. Zhu, J. Hou, D. O. Wu, Cross-modal orthogonal high-rank augmentation for rgb-event transformer-trackers, in: CVPR, 2023, pp. 22045–22055.
[30] K. He, X. Chen, S. Xie, Y. Li, P. Dollár, R. Girshick, Masked autoencoders are scalable vision learners, in: CVPR, 2022, pp. 16000–16009.
[31] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al., An image is worth 16x16 words: Transformers for image recognition at scale, arXiv preprint arXiv:2010.11929.
[32] Y. Cui, C. Jiang, L. Wang, G. Wu, Mixformer: End-to-end tracking with iterative mixed attention, in: CVPR, 2022, pp. 13608–13618.
[33] M. Danelljan, L. V. Gool, R. Timofte, Probabilistic regression for visual tracking, in: CVPR, 2020, pp. 7183–7192.
[34] G. Bhat, M. Danelljan, L. V. Gool, R. Timofte, Learning discriminative model prediction for tracking, in: CVPR, 2019, pp. 6182–6191.
[35] G. Bhat, M. Danelljan, L. V. Gool, R. Timofte, Learning discriminative model prediction for tracking, in: ICCV, 2019, pp. 6182–6191.
[36] S. Gao, C. Zhou, C. Ma, X. Wang, J. Yuan, Aiatrack: Attention in attention for transformer visual tracking, in: ECCV, Springer, 2022, pp. 146–164.
[37] C. Mayer, M. Danelljan, D. P. Paudel, L. Van Gool, Learning target candidate association to keep track of what not to track, in: CVPR, 2021, pp. 13444–13454.
[38] C. Mayer, M. Danelljan, G. Bhat, M. Paul, D. P. Paudel, F. Yu, L. Van Gool, Transforming model prediction for tracking, in: CVPR, 2022, pp. 8731–8740.
[39] N. Wang, W. Zhou, J. Wang, H. Li, Transformer meets tracker: Exploiting temporal context for robust visual tracking, in: CVPR, 2021, pp. 1571–1580.
[40] M. Danelljan, G. Bhat, Pytracking: Visual tracking library based on pytorch (2019).
[41] X. Chen, B. Yan, J. Zhu, D. Wang, X. Yang, H. Lu, Transformer tracking, in: CVPR, 2021.
[42] P. Voigtlaender, J. Luiten, P. H. Torr, B. Leibe, Siam r-cnn: Visual tracking by re-detection, in: CVPR, 2020, pp. 6578–6588.
[43] C. Tang, X. Wang, J. Huang, B. Jiang, L. Zhu, J. Zhang, Y. Wang, Y. Tian, Revisiting color-event based tracking: A unified network, dataset, and metric, arXiv preprint arXiv:2211.11010.
[44] B. Koonce, B. Koonce, Mobilenetv3, Convolutional Neural Networks with Swift for Tensorflow: Image Recognition and Dataset Categorization (2021) 125–144.
[45] K. He, X. Zhang, S. Ren, J. Sun, Spatial pyramid pooling in deep convolutional networks for visual recognition, IEEE transactions on pattern analysis and machine intelligence 37 (9) (2015) 1904–1916.