PhyTracker: An Online Tracker for Phytoplankton

Yang Yu, Qingxuan Lv, Yuezun Li,  Zhiqiang Wei, Junyu Dong Yuezun Li and Junyu Dong are corresponding authors.Yang Yu, Qingxuan Lv, Yuezun Li, Zhiqiang Wei, and Junyu Dong are with the College of Computer Science and Technology, Ocean University of China, China. e-mail: ([email protected]; [email protected]; [email protected]; [email protected]; [email protected]).
Abstract

Phytoplankton, a crucial component of aquatic ecosystems, requires efficient monitoring to understand marine ecological processes and environmental conditions. Traditional phytoplankton monitoring methods, relying on non-in situ observations, are time-consuming and resource-intensive, limiting timely analysis. To address these limitations, we introduce PhyTracker, an intelligent in situ tracking framework designed for automatic tracking of phytoplankton. PhyTracker overcomes significant challenges unique to phytoplankton monitoring, such as constrained mobility within water flow, inconspicuous appearance, and the presence of impurities. Our method incorporates three innovative modules: a Texture-enhanced Feature Extraction (TFE) module, an Attention-enhanced Temporal Association (ATA) module, and a Flow-agnostic Movement Refinement (FMR) module. These modules enhance feature capture, differentiate between phytoplankton and impurities, and refine movement characteristics, respectively. Extensive experiments on the PMOT dataset validate the superiority of PhyTracker in phytoplankton tracking, and additional tests on the MOT dataset demonstrate its general applicability, outperforming conventional tracking methods. This work highlights key differences between phytoplankton and traditional objects, offering an effective solution for phytoplankton monitoring.

Index Terms:
Phytoplankton Observing and Analysis, Object Tracking

I Introduction

Phytoplankton is a general term for plant microorganisms, particularly referring to microalgae (see Fig. 1)  [1]. Phytoplankton is a vital component of aquatic ecosystems, with their activities serving as key indicators for marine ecological processes and environmental conditions [2]. Consequently, monitoring phytoplankton holds significant importance in maintaining the stability of aquatic ecosystems, safeguarding water resources, and advancing scientific exploration [3].

Traditional efforts on monitoring phytoplankton mainly rely on the so-called non-in situ observation approach, that is to collect water samples and bring them back to the laboratory with manual observation [4]. This approach not only consumes considerable time and human resources but also fails to analyze phytoplankton timely. To overcome this limitation, we develop an intelligent tracking framework, called PyTracker, that can be deployed on the ocean to monitor phytoplankton in a way of in situ observations. This framework is designed to automatically localize and categorize phytoplankton and then track them constantly observed in the microscope. The results can provide versatile information in monitoring phytoplankton, and can be utilized for further analysis such as density estimation, action recognition, pose estimation, etc.

Refer to caption
Figure 1: The pedestrian dataset (on the left) and the planktonic dataset (on the right) have different characteristics.

However, tracking phytoplankton is significantly more challenging compared to the scenario of tracking general objects on the ground, which mainly lies in the following three aspects:

  1. 1.

    Inconspicuous appearance: Phytoplankton commonly exhibit tiny sizes, light colors, erratic forms, and straightforward textures. These characteristics significantly increase the challenge of identification than general objects on the ground.

  2. 2.

    Complex monitor scenario: The water samples usually contain impurities dispersed throughout the entire view of observation. These impurities look very similar to phytoplankton, posing challenges to the accurate monitoring of phytoplankton.

  3. 3.

    Different monitor pipeline: Tracking objects on the ground, particularly pedestrians and vehicles, typically utilizes general video cameras and is performed under a wild scenario. However, the pipeline of our task is notably different, which involves extracting water samples from the ocean and gradually passing them through the microscopes for analysis. Under this scenario, the mobility of phytoplankton is highly constrained, and their movement within this pipeline is mainly driven by water flow, resulting in a highly uniform trajectory of these phytoplankton.

These differences greatly limit the application of conventional tracking methods for ground monitoring to phytoplankton monitoring [5, 6, 7]. As such, develo** a tracking framework devoted to this task is necessary.

In this paper, we conduct an in-depth analysis of the unique characteristics of phytoplankton and proposes a new method called PhyTracker devoted to accomplishing the tracking of phytoplankton. Our method achieves tracking in an online manner, which continuously track the phytoplankton alongside water flows with a meticulously designed architecture. Specifically, our method features three designs: 1) Since the appearance of phytoplankton is inconspicuous, we describe a Texture-enhanced Feature Extraction (TFE) module to improve the capture of appearance features, with the incorporation of dilated convolutions and SRM filters. 2) Floating impurities likely disrupt the temporal association of phytoplankton, causing the chaotic target correlation between frames before and after the tracking process. To mitigate this, we propose an Attention-enhanced Temporal Association (ATA) module to tell apart phytoplankton and impurities. The core of this module is an attention mechanism, which can effectively associate corresponding features from consecutive frames, eliminating the interference caused by impurities. 3) Trajectories are an important characteristic of phytoplankton. However, in our monitor pipeline, trajectories of phytoplankton are highly consistent, concealing the characteristic of phytoplankton movement. Therefore, we describe a Flow-agnostic Movement Refinement (FMR) module, which can recover the characteristics of each phytoplankton movement, making phytoplankton more discriminative.

Extensive experiments are conducted on a large-scale public phytoplankton tracking dataset (PMOT) [8], demonstrating the superiority of our method in tracking phytoplankton. Moreover, we validate our method on the general object tracking dataset (MOT) [9] in comparison to recent conventional tracking methods. The results surprisingly corroborate that our method can still outperform others.

The contributions of this paper can be summarized in three-fold:

  1. 1.

    We thoroughly examine the major differences between traditional objects (such as pedestrians and vehicles) and phytoplankton, highlighting three key aspects: different monitor pipelines, inconspicuous appearance, and complex monitor scenarios.

  2. 2.

    To address these challenges, we propose an online tracker with three key improvements: a Texture-enhanced Feature Extraction (TFE) module, an Attention-enhanced Temporal Association (ATA) module and a Flow-agnostic Movement Refinement (FMR) module. Each module is developed to handle corresponding challenge.

  3. 3.

    We conduct comprehensive experiments on both the phytoplankton dataset and the general object tracking dataset. The results demonstrate the effectiveness of the proposed method, particularly in the context of phytoplankton tracking.

Refer to caption
Figure 2: Overview of PhyTracker. PhyTracker receives the current frame picture and the previous frame picture at a time, as well as the trajectory information integrated from all previous frames, to assist in the tracking of the current frame.

II Backgrounds and Related Works

Phytoplankton. Phytoplankton are plant-like life forms (see Fig.1) that play an indispensable role in marine ecosystems. Through photosynthesis, they absorb carbon dioxide and release oxygen, providing a crucial source of oxygen for marine life, while also fixing carbon in organic matter [10]. When they die, some of the organic carbon settles to the ocean floor, forming sediments and participating in long-term carbon storage and the Earth’s carbon cycle [11]. Phytoplankton are also the foundation of the marine food chain. They are consumed by zooplankton and other organisms, thereby supporting the entire marine ecosystem [12]. Additionally, phytoplankton play a role in regulating the global climate by modulating the reflectivity of the ocean surface, and by absorbing and releasing heat. They also impact the chemical composition of the atmosphere, thus affecting atmospheric circulation and climate.

Monitoring Phytoplankton. Observing and real-time monitoring of phytoplankton species, density, and concentration have significant implications for humans and nature [13, 14, 15]. Firstly, changes in phytoplankton species and density can reflect the ecological health of water bodies. By monitoring phytoplankton, we can promptly detect abnormal changes in ecosystems and take appropriate measures [16]. Secondly, phytoplankton are very sensitive to environmental factors such as light, temperature, and carbon dioxide. Monitoring changes in phytoplankton can provide valuable data on climate change [17] Moreover, phytoplankton form the base of the marine food chain. Their density and distribution directly affect the reproduction and survival of other marine organisms. By monitoring phytoplankton, fishery managers can predict the density and distribution of fish resources, thereby formulating more effective fishery management strategies [13]. Additionally, certain species of phytoplankton can proliferate under specific conditions, forming harmful algal blooms (such as red tides), which lead to oxygen depletion in water bodies, release toxins, and pose threats to aquatic life and human health. Real-time monitoring of phytoplankton can provide early warnings of harmful algal blooms, reducing their negative impacts [18].

Traditional monitoring methods, referred to as non-in-situ observation, involve the collection of water samples and their observation under microscopes by trained personnel [4, 19, 20]. However, these techniques require a lot of time and human resources and lack the ability for timely phytoplankton analysis. Recently, advancements in hardware have made in-situ observation feasible by integrating digital microscopes, flow pumps, and computational chips into a single device [21]. While this device shows promising potential for phytoplankton monitoring, the algorithms specifically dedicated to this task remains unexplored. Typically, tracking is the prerequisite task for monitoring phytoplankton. Given the tracking results, we can futher analyze the activities of phytoplankton. However, existing tracking algorithms are designed for ground scenario. Compared to the ground scenario, tracking phytoplankton poses unique challenges due to the different monitor pipelines, the inconspicuous appearance of phytoplankton, and the complexity of monitoring scenario. Therefore, there is an urgent need to develop a devoted tracking framework for phytoplankton.

Object Tracking. The existing tracking methods are mainly designed for ground scenarios, focusing on tracking general objects such as vehicles and pedestrians [22, 23, 24, 25, 26]. To obtain the trajectory of objects, Object extraction, temporal association and motion prediction are three important aspects in multi-object tracking within video sequences. According to the tracking pipeline, existing methods can be divided into two categories: Offline tracking and Online tracking.

Offline Tracking. Offline tracking allows the use of information from subsequent frames and is formulated as a graph model for a globally optimal solution [27, 28, 29]. However, this setup makes it not suitable for practical applications due to its reliance on future frame data.

Online Tracking. Online multi-object tracking involves calculating the match between current object detection and existing trajectories based on the available data [30, 31, 32, 33]. The nature of online tracking requires that the decision for each frame’s tracking outcome must rely solely on information from the current and previous frames. It is imperative that the algorithm cannot use the current frame’s data to alter the results of previous frames. Therefore, this paper focuses on the setting of onlione tracking.

Refer to caption
Figure 3: Each row displays the focus points of the algorithm on a single target only. The first column is the original image, the second column is DLA34+ByteTrack, and the third column is PhyTracker.

III Motivation and Preliminary Validation

The straightforward solution for tracking phytoplankton is to directly adapt existing tracking methods into this task. However, these methods are designed for ground scenarios and focus on tackling common challenges such as occlusions, target matching between frames, and camera jitters. Thus they do not align with the scenario of phytoplankton in aquatic environments. Unlike conventional tracking targets such as pedestrian and vehicles, phytoplankton exhibits significant differences in data characteristics, including their unique low-contrast color features, the stark contrast between aquatic and ground environments, and their distinct movement patterns compared to other organisms. These differences significantly hinder the application of existing methods to phytoplankton monitoring. To validate this, we adopt the recent tracking methods ByteTrack [34] to our task. As shown in Fig. 3, it can be seen that the generated attention heatmaps of ByteTrack can hardly concentrate on the phytoplankton, indicating the infeasibility of directly adopting existing tracking methods.

IV Method

This paper describes an online tracker, PhyTracker, devoted to monitoring phytoplankton. Compared to existing tracking methods, our method features three major improvements. First, we introduce a Texture-enhanced Feature Extraction (TFE) module to enhance the appearance distinction of phytoplankton. This improvement enables the phytoplankton becoming more detectable. Second, we propose an Attention-enhanced Temporal Association (ATA) module, which optimizes the feature distances between targets in adjacent frames, enhancing the capacity of model to distinguish between similar phytoplankton, as well as impurities and phytoplankton. Furthermore, we introduce a Flow-agnostic Movement Refinement (FMR) module, which effectively reduces feature confusion from similar motion trajectories between different tracking entities and preserves original movement offset information, thereby enhancing sensitivity to individual movement characteristics. These three improvements correspond to solving the three difficulties in phytoplankton monitoring, as described in Sec.1.

IV-A Problem Setup

Denote a video sequence with total N𝑁Nitalic_N frames as 𝒱={t}t=1N𝒱superscriptsubscriptsubscript𝑡𝑡1𝑁\mathcal{V}=\{\mathcal{I}_{t}\}_{t=1}^{N}caligraphic_V = { caligraphic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT, where th×w×3subscript𝑡superscript𝑤3\mathcal{I}_{t}\in\mathbb{R}^{h\times w\times 3}caligraphic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_h × italic_w × 3 end_POSTSUPERSCRIPT refers to the t𝑡titalic_t-th frame. Suppose this video sequence contains 𝒟𝒟\mathcal{D}caligraphic_D phytoplankton. The goal of our method is to output the trajectories of all phytoplankton {𝒯i}i=1𝒟superscriptsubscriptsubscript𝒯𝑖𝑖1𝒟\{\mathcal{T}_{i}\}_{i=1}^{\mathcal{D}}{ caligraphic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_D end_POSTSUPERSCRIPT within the video sequence 𝒱𝒱\mathcal{V}caligraphic_V. The i𝑖iitalic_i-th trajectory is a collection of bounding boxes at corresponding frames and its class label, defined as 𝒯i={(bi,t1,,bi,tN),yi}subscript𝒯𝑖subscript𝑏𝑖subscript𝑡1subscript𝑏𝑖subscript𝑡𝑁subscript𝑦𝑖\mathcal{T}_{i}=\{(b_{i,t_{1}},...,b_{i,t_{N}}),y_{i}\}caligraphic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = { ( italic_b start_POSTSUBSCRIPT italic_i , italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , … , italic_b start_POSTSUBSCRIPT italic_i , italic_t start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT }, where bi,tjsubscript𝑏𝑖subscript𝑡𝑗b_{i,t_{j}}italic_b start_POSTSUBSCRIPT italic_i , italic_t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT is the bounding box of i𝑖iitalic_i-th phytoplankton at temporal index tjsubscript𝑡𝑗t_{j}italic_t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, tNsubscript𝑡𝑁t_{N}italic_t start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT is the length of trajectory and yisubscript𝑦𝑖y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT represents category.

IV-B Framework Workflow

The workflow of our method is inspired by TraDes [35]. Given a frame tsubscript𝑡\mathcal{I}_{t}caligraphic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, it is first passed into Texture-enhanced Feature Extraction (TFE) module to extract appearance features of phytoplankton as fth4×w4×64superscript𝑓𝑡superscript4𝑤464f^{t}\in\mathbb{R}^{\frac{h}{4}\times\frac{w}{4}\times 64}italic_f start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT divide start_ARG italic_h end_ARG start_ARG 4 end_ARG × divide start_ARG italic_w end_ARG start_ARG 4 end_ARG × 64 end_POSTSUPERSCRIPT. Then the feature of current frame ftsuperscript𝑓𝑡f^{t}italic_f start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT and the one of previous frame ft1superscript𝑓𝑡1f^{t-1}italic_f start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT are sent into the Attention-enhanced Temporal Association (ATA) module, to build the correlations between these two features. Based on the correlations, ATA can predict the movement offsets of phytoplankton. Then the movement offsets are forwarded into Flow-agnostic Movement Refinement (FMR) module, which eliminates the effect of water flows and fuses the knowledge from previous frames into the head network for final prediction.

Refer to caption
Figure 4: We use three consecutive SIE modules(Semantic Information Extraction module) to extract semantic feature. The core of the SIE network is the SRM filter.

IV-C Texture-enhanced Feature Extraction

Phytoplankton often exhibit appearance similar to their natural aquatic environments, making it difficult for traditional methods to capture discriminative features. In light of this, we propose a Texture-enhanced Feature Extraction (TFE) module to enhance feature extraction of phytoplankton (see Fig. 4). Inspired by [36, 37], this module extracts the noise information combined with semantic information to enrich the representation of phytoplankton features. Specifically, to amplify the subtle textures, we employ dilated convolutions [38] to expand the receptive field without increasing computational load. Then several SIE blocks are proposed to refine the features. SIE block is composed of convolution layers and SRM layers. SRM filters were originally designed to address the issues of image denoising and edge preservation in image processing. Under the microscope, there are situations where the flow image is unclear and there are many impurities. In response to this, we have modified the SRM filter to match the effectiveness of extracting additional information from phytoplankton data. The SRM layer is three 5×\times×5 convolutional kernels with fixed values that remain unchanged, the kernels are:

[0000001210024200121000000][12221268622812822686212221][1111110001108011000111111]matrix0000001210024200121000000matrix12221268622812822686212221matrix1111110001108011000111111\displaystyle\begin{bmatrix}0&0&0&0&0\\ 0&-1&2&-1&0\\ 0&2&-4&2&0\\ 0&-1&2&-1&0\\ 0&0&0&0&0\end{bmatrix}\begin{bmatrix}-1&2&-2&2&-1\\ 2&-6&8&-6&2\\ -2&8&-12&8&-2\\ 2&-6&8&-6&2\\ -1&2&-2&2&-1\end{bmatrix}\begin{bmatrix}-1&-1&-1&-1&-1\\ -1&0&0&0&-1\\ -1&0&8&0&-1\\ -1&0&0&0&-1\\ -1&-1&-1&-1&-1\end{bmatrix}[ start_ARG start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL - 1 end_CELL start_CELL 2 end_CELL start_CELL - 1 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 2 end_CELL start_CELL - 4 end_CELL start_CELL 2 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL - 1 end_CELL start_CELL 2 end_CELL start_CELL - 1 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL end_ROW end_ARG ] [ start_ARG start_ROW start_CELL - 1 end_CELL start_CELL 2 end_CELL start_CELL - 2 end_CELL start_CELL 2 end_CELL start_CELL - 1 end_CELL end_ROW start_ROW start_CELL 2 end_CELL start_CELL - 6 end_CELL start_CELL 8 end_CELL start_CELL - 6 end_CELL start_CELL 2 end_CELL end_ROW start_ROW start_CELL - 2 end_CELL start_CELL 8 end_CELL start_CELL - 12 end_CELL start_CELL 8 end_CELL start_CELL - 2 end_CELL end_ROW start_ROW start_CELL 2 end_CELL start_CELL - 6 end_CELL start_CELL 8 end_CELL start_CELL - 6 end_CELL start_CELL 2 end_CELL end_ROW start_ROW start_CELL - 1 end_CELL start_CELL 2 end_CELL start_CELL - 2 end_CELL start_CELL 2 end_CELL start_CELL - 1 end_CELL end_ROW end_ARG ] [ start_ARG start_ROW start_CELL - 1 end_CELL start_CELL - 1 end_CELL start_CELL - 1 end_CELL start_CELL - 1 end_CELL start_CELL - 1 end_CELL end_ROW start_ROW start_CELL - 1 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL - 1 end_CELL end_ROW start_ROW start_CELL - 1 end_CELL start_CELL 0 end_CELL start_CELL 8 end_CELL start_CELL 0 end_CELL start_CELL - 1 end_CELL end_ROW start_ROW start_CELL - 1 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL - 1 end_CELL end_ROW start_ROW start_CELL - 1 end_CELL start_CELL - 1 end_CELL start_CELL - 1 end_CELL start_CELL - 1 end_CELL start_CELL - 1 end_CELL end_ROW end_ARG ] (1)

The features from SIE blocks are then integrated with the original images, which are sent into a DLA34 network [39]. This module can reveal differences in textures between phytoplankton and the aquatic environment, which are not easily observable in the RGB space.

Refer to caption
Figure 5: We take the basic features of three consecutive frames as input and output the similarity information between frames.

IV-D Attention-enhanced Temporal Association

Refer to caption
Figure 6: Each line in the graph represents the similarity between points.
Refer to caption
Figure 7: The figure shows the calculation process of generating offsets from the Similarity Map, and the generated offsets are stored in the 𝒪tsubscript𝒪𝑡\mathcal{O}_{t}caligraphic_O start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT in FMR.

Effectively associating the features of phytoplankton across frames is crucial for accurate tracking results. However, temporal association in our task is challenging, due to the widespread impurities in observing scenario and similar-appearance phytoplankton of different types.

To address this issue, we propose an Attention-enhanced Temporal Association (ATA) module, which is designed to effectively find the feature association of same target across consecutive frames. The core of this module is a newly proposed Two-Stage Cross-Attention (TSCA) operations based on existing self-attention mechanism [40, 41]. As shown in Fig. 5, the first stage is a feature refinement block, which enhances the features of current frame ftsuperscript𝑓𝑡f^{t}italic_f start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT and previous frame ft1superscript𝑓𝑡1f^{t-1}italic_f start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT. Then the enhanced features are executed a cross-attention operation to model the temporal associations.

Specifically, in the first stage, the feature ft1superscript𝑓𝑡1f^{t-1}italic_f start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT is sent into the CP and CB modules, and then added with ft1superscript𝑓𝑡1f^{t-1}italic_f start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT for refinement. Different from ft1superscript𝑓𝑡1f^{t-1}italic_f start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT, the feature ftsuperscript𝑓𝑡f^{t}italic_f start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT is separately processed by Conv and CP modules, and calculates attention by matrix multiplication. The multiplication result is then sent into CB block, and perform residual connection with ftsuperscript𝑓𝑡f^{t}italic_f start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT for refinement. In the second stage, the refined features are performed similar operations. But the difference is that we performance cross-attention operations to obtain the intermediate feature ωtsuperscript𝜔𝑡\omega^{t}italic_ω start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT. Let Q𝑄Qitalic_Q represent query features obtained from ft1superscript𝑓𝑡1f^{t-1}italic_f start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT, and K,V𝐾𝑉K,Vitalic_K , italic_V represent key and value features obtained from ftsuperscript𝑓𝑡f^{t}italic_f start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT. Inspired by [41], the cross-attention operation can be defined as

CA(Q,K,V)=ϕq(Q)(ϕk(K)TV),CA𝑄𝐾𝑉subscriptitalic-ϕ𝑞𝑄subscriptitalic-ϕ𝑘superscript𝐾𝑇𝑉\textrm{CA}(Q,K,V)={\phi_{q}}(Q)({\phi_{k}}(K)^{T}V),CA ( italic_Q , italic_K , italic_V ) = italic_ϕ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ( italic_Q ) ( italic_ϕ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_K ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_V ) , (2)

where ϕqsubscriptitalic-ϕ𝑞\phi_{q}italic_ϕ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT and ϕksubscriptitalic-ϕ𝑘\phi_{k}italic_ϕ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT are the normalization functions for query and key features, implementing by normalization methods:

ϕq(Q)=softmaxrow(Q)subscriptitalic-ϕ𝑞𝑄subscriptsoftmax𝑟𝑜𝑤𝑄\displaystyle{\phi_{q}}(Q)={\textrm{softmax}_{row}}(Q)italic_ϕ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ( italic_Q ) = softmax start_POSTSUBSCRIPT italic_r italic_o italic_w end_POSTSUBSCRIPT ( italic_Q ) (3)
ϕk(K)=softmaxcol(K),subscriptitalic-ϕ𝑘𝐾subscriptsoftmax𝑐𝑜𝑙𝐾\displaystyle{\phi_{k}}(K)={\textrm{softmax}_{col}}(K),italic_ϕ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_K ) = softmax start_POSTSUBSCRIPT italic_c italic_o italic_l end_POSTSUBSCRIPT ( italic_K ) ,

note that softmaxrowsubscriptsoftmax𝑟𝑜𝑤\textrm{softmax}_{row}softmax start_POSTSUBSCRIPT italic_r italic_o italic_w end_POSTSUBSCRIPT, softmaxcolsubscriptsoftmax𝑐𝑜𝑙\textrm{softmax}_{col}softmax start_POSTSUBSCRIPT italic_c italic_o italic_l end_POSTSUBSCRIPT denotes the application of the softmax function along each row or column of the corresponding input.

It is important to note that the feature ωtsuperscript𝜔𝑡\omega^{t}italic_ω start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT represents a feature that enhances the association of the same tracking target in the current frame feature ftsuperscript𝑓𝑡f^{t}italic_f start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT and the past frame feature ft1superscript𝑓𝑡1f^{t-1}italic_f start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT.

Inspired by [35], we then seek the similarity between adjacent features ωt1superscript𝜔𝑡1\omega^{t-1}italic_ω start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT and ωtsuperscript𝜔𝑡\omega^{t}italic_ω start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT to obtain higher level associations. These two features are sent into a convolution block σ𝜎\sigmaitalic_σ and then perform matrix multiplication with each other to create a four-dimensional similarity matrix 𝒮h×w×h×w𝒮superscriptsuperscriptsuperscript𝑤superscriptsuperscript𝑤\mathcal{S}\in\mathbb{R}^{h^{\prime}\times w^{\prime}\times h^{\prime}\times w% ^{\prime}}caligraphic_S ∈ blackboard_R start_POSTSUPERSCRIPT italic_h start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT × italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT × italic_h start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT × italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT, where h=h8,w=w8formulae-sequencesuperscript8superscript𝑤𝑤8h^{\prime}=\frac{h}{8},w^{\prime}=\frac{w}{8}italic_h start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = divide start_ARG italic_h end_ARG start_ARG 8 end_ARG , italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = divide start_ARG italic_w end_ARG start_ARG 8 end_ARG. The element 𝒮(i,j,k,l)𝒮𝑖𝑗𝑘𝑙\mathcal{S}(i,j,k,l)caligraphic_S ( italic_i , italic_j , italic_k , italic_l ) denotes the similarity between the location of (i,j)𝑖𝑗(i,j)( italic_i , italic_j ) in the current frame t𝑡titalic_t and the location of (k,l)𝑘𝑙(k,l)( italic_k , italic_l ) in the previous frame t1𝑡1t-1italic_t - 1 (see Fig. 6).

Based on this similarity matrix 𝒮𝒮\mathcal{S}caligraphic_S, we can generate the offset information following [35]. As shown in Fig. 7, for a phytoplankton centered at the location of (i,j)𝑖𝑗(i,j)( italic_i , italic_j ) in the current frame t𝑡titalic_t, we can fetch its two-dimensional similarity map 𝒞i,jh×wsubscript𝒞𝑖𝑗superscriptsuperscriptsuperscript𝑤\mathcal{C}_{i,j}\in\mathbb{R}^{h^{\prime}\times w^{\prime}}caligraphic_C start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_h start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT × italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT from matrix 𝒮𝒮\mathcal{S}caligraphic_S. This similarity map 𝒞i,jsubscript𝒞𝑖𝑗\mathcal{C}_{i,j}caligraphic_C start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT stores the similarities among the phytoplankton and all locations in the previous frame t1𝑡1t-1italic_t - 1. To calculate the offset, 𝒞i,jsubscript𝒞𝑖𝑗\mathcal{C}_{i,j}caligraphic_C start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT is first max pooled by 1×w1superscript𝑤1\times w^{\prime}1 × italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and h×1superscript1h^{\prime}\times 1italic_h start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT × 1 kernels and then normalized by a softmax function to obtain two vectors, 𝒞i,j𝒳[0,1]1×wsubscriptsuperscript𝒞𝒳𝑖𝑗superscript011superscript𝑤\mathcal{C}^{\mathcal{X}}_{i,j}\in[0,1]^{1\times w^{\prime}}caligraphic_C start_POSTSUPERSCRIPT caligraphic_X end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ∈ [ 0 , 1 ] start_POSTSUPERSCRIPT 1 × italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT and 𝒞i,j𝒴[0,1]h×1subscriptsuperscript𝒞𝒴𝑖𝑗superscript01superscript1\mathcal{C}^{\mathcal{Y}}_{i,j}\in[0,1]^{h^{\prime}\times 1}caligraphic_C start_POSTSUPERSCRIPT caligraphic_Y end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ∈ [ 0 , 1 ] start_POSTSUPERSCRIPT italic_h start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT × 1 end_POSTSUPERSCRIPT, respectively. These two vectors represent the likelihood of this phytoplankton appearing on horizontal and vertical locations in frame t𝑡titalic_t. Then we create two offset templates 𝒯i,j𝒳1×w,𝒯i,j𝒴h×1formulae-sequencesubscriptsuperscript𝒯𝒳𝑖𝑗superscript1superscript𝑤subscriptsuperscript𝒯𝒴𝑖𝑗superscriptsuperscript1\mathcal{T}^{\mathcal{X}}_{i,j}\in\mathbb{R}^{1\times w^{\prime}},\mathcal{T}^% {\mathcal{Y}}_{i,j}\in\mathbb{R}^{h^{\prime}\times 1}caligraphic_T start_POSTSUPERSCRIPT caligraphic_X end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 1 × italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT , caligraphic_T start_POSTSUPERSCRIPT caligraphic_Y end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_h start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT × 1 end_POSTSUPERSCRIPT in the horizontal and vertical directions, respectively, which are calculated by

𝒯i,j𝒳(l)subscriptsuperscript𝒯𝒳𝑖𝑗𝑙\displaystyle\mathcal{T}^{\mathcal{X}}_{i,j}(l)caligraphic_T start_POSTSUPERSCRIPT caligraphic_X end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ( italic_l ) =(lj)×s,absent𝑙𝑗𝑠\displaystyle=(l-j)\times s,= ( italic_l - italic_j ) × italic_s , 1lw1𝑙superscript𝑤\displaystyle 1\leq l\leq w^{\prime}1 ≤ italic_l ≤ italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT (4)
𝒯i,j𝒴(k)subscriptsuperscript𝒯𝒴𝑖𝑗𝑘\displaystyle\mathcal{T}^{\mathcal{Y}}_{i,j}(k)caligraphic_T start_POSTSUPERSCRIPT caligraphic_Y end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ( italic_k ) =(ki)×s,absent𝑘𝑖𝑠\displaystyle=(k-i)\times s,= ( italic_k - italic_i ) × italic_s , 1kh1𝑘superscript\displaystyle 1\leq k\leq h^{\prime}1 ≤ italic_k ≤ italic_h start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT

where s𝑠sitalic_s is the feature stride of ωstsuperscriptsubscript𝜔𝑠𝑡\omega_{s}^{t}italic_ω start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT. 𝒯i,j𝒳(l)subscriptsuperscript𝒯𝒳𝑖𝑗𝑙\mathcal{T}^{\mathcal{X}}_{i,j}(l)caligraphic_T start_POSTSUPERSCRIPT caligraphic_X end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ( italic_l ) refers to the horizontal offset when the phytoplankton appears at the location of (:,l):𝑙(:,l)( : , italic_l ) in frame t1𝑡1t-1italic_t - 1. Let 𝒪t=[𝒪t𝒳,𝒪t𝒴]subscript𝒪𝑡subscriptsuperscript𝒪𝒳𝑡subscriptsuperscript𝒪𝒴𝑡\mathcal{O}_{t}=[\mathcal{O}^{\mathcal{X}}_{t},\mathcal{O}^{\mathcal{Y}}_{t}]caligraphic_O start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = [ caligraphic_O start_POSTSUPERSCRIPT caligraphic_X end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , caligraphic_O start_POSTSUPERSCRIPT caligraphic_Y end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] be the offset information, containing the horizontal and vertical offsets respectively. Each offset can be inferred by the dot product between the likelihoods and actual offset values as

𝒪t𝒳=𝒞i,j𝒳𝒯i,j𝒳,𝒪t𝒴=𝒞i,j𝒴𝒯i,j𝒴.formulae-sequencesubscriptsuperscript𝒪𝒳𝑡superscriptsubscript𝒞𝑖𝑗𝒳subscriptsuperscript𝒯𝒳𝑖𝑗subscriptsuperscript𝒪𝒴𝑡superscriptsubscript𝒞𝑖𝑗𝒴subscriptsuperscript𝒯𝒴𝑖𝑗\mathcal{O}^{\mathcal{X}}_{t}=\mathcal{C}_{i,j}^{\mathcal{X}}\mathcal{T}^{% \mathcal{X}}_{i,j},\;\mathcal{O}^{\mathcal{Y}}_{t}=\mathcal{C}_{i,j}^{\mathcal% {Y}}\mathcal{T}^{\mathcal{Y}}_{i,j}.caligraphic_O start_POSTSUPERSCRIPT caligraphic_X end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = caligraphic_C start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_X end_POSTSUPERSCRIPT caligraphic_T start_POSTSUPERSCRIPT caligraphic_X end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT , caligraphic_O start_POSTSUPERSCRIPT caligraphic_Y end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = caligraphic_C start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_Y end_POSTSUPERSCRIPT caligraphic_T start_POSTSUPERSCRIPT caligraphic_Y end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT . (5)

By applying this offset information, we can obtain the temporal association between the previous frame and the current frame. This offset information is then sent into the third module FMR to eliminate the impact of displacement caused by phytoplankton in the past frames.

Refer to caption
Figure 8: The interior of the dashed line represents the FMR module at frame t, while the rest represents external variables.

IV-E Flow-agnostic Movement Refinement

The motion characteristics of phytoplankton under this scenario are mainly driven by water currents, exhibiting highly consistent movement trajectory characteristics within the pipeline. This phenomenon may lead to an undue reduction of feature differences of different phytoplankton classes, degrading the tracking performance.

To this end, we propose a Flow-agnostic Movement Refinement (FMR) module (see Fig. 8). This module aims to eliminate the effect of similar motion trajectories caused by water flow and amplify the differentiation between different phytoplankton individuals at the feature level. To achieve this, we need to first estimate the movement of water flow. Since the movement of all phytoplankton is driven by water flow, we can average the movement of all phytoplankton over all frames to represent water movement. Nevertheless, our framework is running online, which means it is impossible to access the future frames. To compensate, we maintain an offset memory bank and store the offset 𝒪tsubscript𝒪𝑡\mathcal{O}_{t}caligraphic_O start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT constantly. Then we can average the offset in the bank as λt=1t(𝒪t++𝒪0)subscript𝜆𝑡1𝑡subscript𝒪𝑡subscript𝒪0\lambda_{t}=\frac{1}{t}\sum(\mathcal{O}_{t}+...+\mathcal{O}_{0})italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_t end_ARG ∑ ( caligraphic_O start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + … + caligraphic_O start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ). In the current frame t𝑡titalic_t, we subtract λ𝜆\lambdaitalic_λ from the offset 𝒪tsubscript𝒪𝑡\mathcal{O}_{t}caligraphic_O start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to obtain the movement characteristics of phytoplankton, which serves as an extra feature added with the original offset 𝒪tsubscript𝒪𝑡\mathcal{O}_{t}caligraphic_O start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to form flow-agnostic feature ΩtsubscriptΩ𝑡\Omega_{t}roman_Ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. This process can be written as Ωt=2𝒪tλtsubscriptΩ𝑡2subscript𝒪𝑡subscript𝜆𝑡\Omega_{t}=2\mathcal{O}_{t}-\lambda_{t}roman_Ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = 2 caligraphic_O start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT.

Then we propagate the features t1superscript𝑡1\mathcal{H}^{t-1}caligraphic_H start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT from the previous frame to the current refined offset ΩtsubscriptΩ𝑡\Omega_{t}roman_Ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT using a deformable convolution [42], inspired by [35]. The feature t1superscript𝑡1\mathcal{H}^{t-1}caligraphic_H start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT is calculated as

t1=ωt1𝒫t1,superscript𝑡1superscript𝜔𝑡1superscript𝒫𝑡1\mathcal{H}^{t-1}=\omega^{t-1}\circ\mathcal{P}^{t-1},caligraphic_H start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT = italic_ω start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT ∘ caligraphic_P start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT , (6)

where \circ is the Hadamard product, 𝒫t1superscript𝒫𝑡1\mathcal{P}^{t-1}caligraphic_P start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT is the class-agnostic center heatmap produced from the head network (we use the same head network as CenterNet [43]).

V Experiments

V-A Experimental Settings

Datasets. Our method is validated on two public datasets, PMOT [8] and MOT [9].

  1. 1.

    The PMOT dataset is a synthetic dataset specifically designed for phytoplankton tracking tasks, encompassing a total of 21 categories. It simulates video footage of plankton observed in flowing pipes under a microscope, making it the first synthetic dataset of its kind. Tracking phytoplankton in real-world environments involves much more complex scenes compared to laboratory settings. Therefore, we expanded our dataset to simulate the presence of noise found in real-world scenarios. The phytoplankton dataset we use is derived from modifications to the PMOT2023 dataset. We integrated data collected over the years from our laboratory, selecting the most suitable portions for inclusion. Based on this, we applied transformations such as occlusions, gray processing, blurring and salt-and-pepper noise to simulate complex underwater environmental conditions. Depending on the degree of added noise, we classified the entire phytoplankton dataset into three difficulty levels: no noise for easy difficulty; a lower degree of noise for medium difficulty; and a higher degree of noise for hard difficulty. As shown in Fig. 9. This allowed us to validate the model’s generalization capabilities under more realistic conditions. The entire dataset consists of 9 original video segments and 63 noise-added video segments, with the length of the transformed videos matching that of the original videos. The composition of the dataset is shown in Table V-A. The ablation studies were evaluated using the phytoplankton dataset.

  2. 2.

    To fully demonstrate the effectiveness of our method, we conducted experiments on the MOT dataset, a widely recognized benchmark in the field of multi-object tracking. The MOT dataset is designed to present a range of complex scenarios, including busy streets, shop** malls, parks, and other environments, and is provided by the MOT Challenge organization. The MOT17 dataset includes annotations for the training set but does not provide annotations for the test set; performance metrics can only be obtained by uploading the tracking results to the MOTChallenge website for evaluation. In our experiments with the MOT17 dataset, we performed an overall algorithm evaluation using only the training set. Specifically, we divided the training set into two halves: one half was used for training, and the other half was used for testing.

TABLE I: During training, we only use the original images from the training set for training.
{tabu}

to X[c]|X[c]|X[c]|X[c]|X[c]|X[c]|X[c]|X[c]|X[c]|X[c]|X[c] \tabucline[1.5pt]- Number of Frame Images   TRAIN   VAL   TEST
\tabucline[1pt]- State   Total number   Video 1   Video 2   Video 3   Video 1   Video 2   Video 3   Video 1   Video 2   Video 3
\tabucline[1pt]- initial 36000 9000 9000 9000 1800 1800 1800 1200 1200 1200
\tabucline[1pt]- noise 252000 63000 63000 63000 12600 12600 12600 8400 8400 8400
\tabucline[1.5pt]-

Evaluation Metrics. Following previous methods [35], we employ CLEAR-MOT metrics [44] to evaluate the tracking performance. This metric consists of various indicators, including MOTA, IDF1, IDs, FP, FN. The explanation of each indicator is introduced as follows:

  1. -

    MOTA is a comprehensive evaluation metric used to measure the overall performance of a tracker across the entire dataset;

  2. -

    IDF1 measures a tracker’s ability to maintain correct identity labels throughout an entire sequence;

  3. -

    False Positives (FP) refer to instances where the tracking algorithm erroneously detects the presence of a target that does not exist in reality;

  4. -

    False Negatives (FN) refer to instances where the tracking algorithm fails to detect a target that is actually present;

  5. -

    Identity switches (IDs) occur when the tracking algorithm mistakenly changes the identifier of a target during the tracking process.

Among these indicators, MOTA and IDF1 stand out as the most important for evaluating tracking performance.

Refer to caption
Figure 9: The three rows represent easy, medium, and hard difficulty levels, with no noise added in the easy level. Each column represents a type of added noise, with no gray processing present in moderate difficulty.
TABLE II: Performance of different algorithms on the no-noise phytoplankton dataset.
{tabu}

c|ccccc \tabucline[1.5pt]- Method & IDF1\uparrow MOTA\uparrow FP\downarrow FN\downarrow IDs\downarrow
\tabucline[1.0pt]- Sort (ICIP’16)[45] 69.3 55.4 171 3829 392
DeepSort (ICIP’17)[46] 24.7 39.1 1906 4032 53
DeepMot (CVPR’20)[47] 16.9 40.8 2038 3737 52
TraDeS (CVPR’21)[35] 85.4 78.3 417 1641 81
BotSort (Arxiv’22)[48] 57.9 40.0 594 5239 70
ByteTrack (ECCV’22)[34] 84.6 71.8 588 2183 2
UAVMot (CVPR’22)[49] 83.2 69.7 601 2375 11
StrongSort (TMM’23)[50] 48.3 37.0 792 4612 794
UCMCTrack (AAAI’24)[51] 61.2 56.4 153 2467 425
BoostTrack (MVA’24)[52] 86.4 75.6 306 1746 6
TLTDMOT (CVPR’24)[53] 85.2 74.3 307 1557 46
\tabucline[1.0pt]- PhyTracker 88.4 80.8 356 1482 55
\tabucline[1.5pt]-

Implementation Details. For training, the input image size was uniformly adjusted to 860 ×\times× 640, and the learning rate was decreased at the 40th and 50th epochs. We utilized the DLA34 network [39] for feature extraction and trained on a single 4080 GPU for 60 epochs, with a batch size set to 5 and an initial learning rate of 2.5×\times×104superscript10410^{-4}10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT. In the inference phase, the tracking confidence threshold was set to 0.4, the input image size was set to 640 ×\times× 960, and input was based on the previous two consecutive frames simultaneously. In the comparative analysis of tracking-by-detection approaches, CenterNet is consistently utilized as the detection framework, with DLA34 serving as its backbone architecture. The remaining parts follow the default settings of TraDeS [35].

To enhance generalization capabilities and prevent overfitting, we incorporated data augmentation into our training process with a certain probability for each set of data. The primary techniques used include flip**, crop**, scaling, and translating, among others. Regarding training duration, both our method and the comparison method adhere to the same number of training epochs, which is 60. When training PhyTracker, we standardized the input image size to 860 ×\times× 640 and reduced the learning rate at the 40th and 50th epochs.

We use the general objectives as in [35], which is defined as =det+mask+CVAsubscriptdetsubscriptmasksubscriptCVA\mathcal{L}=\mathcal{L}_{\mathrm{det}}+\mathcal{L}_{\mathrm{mask}}+\mathcal{L}% _{\mathrm{CVA}}caligraphic_L = caligraphic_L start_POSTSUBSCRIPT roman_det end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT roman_mask end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT roman_CVA end_POSTSUBSCRIPT, where detsubscriptdet\mathcal{L}_{\mathrm{det}}caligraphic_L start_POSTSUBSCRIPT roman_det end_POSTSUBSCRIPT is the detection loss [43], masksubscriptmask\mathcal{L}_{\mathrm{mask}}caligraphic_L start_POSTSUBSCRIPT roman_mask end_POSTSUBSCRIPT is the instance segmentation loss [54], and a ReID loss [55].

TABLE III: Gray Processing is not applied in the medium difficulty level. In the hard difficulty level of Gray Processing, we convert all images to grayscale. In the table below, if the value of the MOTA metric is less than or equal to 0, it is represented by "-".
{tabu}

c@  |c@  |c@  c@  c@  |c@  c@  c@  c@   \tabucline[1.5pt]- MOTA      EASY      MEDIUM      HARD   
     NONE      Occlusion    Blurring    Salt and Pepper      Occlusion    Gray Processing    Blurring    Salt and Pepper   
\tabucline[1.0pt]- Sort (ICIP’16)[45] 55.4 28.5 43.0 0.7 18.5 12.4 18.4 0.3
DeepSort (ICIP’17)[46] 39.1 - 24.1 - - - - -
DeepMot (CVPR’20)[47] 40.8 2.8 32.3 0.2 - - - -
TraDeS (CVPR’21)[35] 78.3 52.0 44.1 0.7 45.1 27.5 - -
BotSort (Arxiv’22)[48] 40.0 16.4 9.5 0.3 9.7 9.5 - -
ByteTrack (ECCV’22)[34] 71.8 36.2 56.8 1.1 24.4 17.1 12.1 0.5
UAVMot (CVPR’22)[49] 69.7 35.7 57.0 1.1 24.7 17.8 12.7 0.6
StrongSort (TMM’23)[50] 37.0 27.0 35.1 0.9 3.2 10.5 - 0.2
UCMCTrack (AAAI’24)[51] 56.4 11.5 31.3 0.1 8.6 15.3 - -
BoostTrack (MVA’24)[52] 75.6 44.5 58.6 1.3 36.4 28.1 26.1 0.6
TLTDMOT (CVPR’24)[53] 74.3 45.0 54.2 0.9 32.6 17.7 17.7 -
\tabucline[1.0pt]- PhyTracker 80.8 52.4 53.2 1.4 45.4 31.9 24.9 0.8
\tabucline[1.5pt]-

TABLE IV: In MOT17, PhyTracker outperforms TraDeS in IDF1, MOTA, and FN metrics.
{tabu}

c|ccccc \tabucline[1.5pt]- Method & IDF1\uparrow MOTA\uparrow FP\downarrow FN\downarrow IDs\downarrow
\tabucline[1.0pt]- TraDeS [35] 58.2 51.8 1091 2554 38
PhyTracker 60.4 53.6 1123 2465 41
\tabucline[1.5pt]-

Compared Methods. To ensure the thoroughness of our experiments, we selected several methods with differing focuses for comparative analysis: SORT [45], DeepSORT [46], StrongSORT [50], BotSORT [48], DeepMOT [47], ByteTrack [34], and UAVMOT [49]. The introduction of each method is as follows:

  1. -

    SORT (ICIP’16) [45]: SORT is a simple and real-time multi-object tracking algorithm that uses a Kalman filter and the Hungarian algorithm for data association, significantly enhancing tracking accuracy and speed.

  2. -

    DeepSORT (ICIP’17) [46]: DeepSORT is an algorithm that builds upon SORT by incorporating ReID (Re-identification) and appearance features of detection boxes. It utilizes a Matching Cascade approach to reduce the number of target ID switches.

  3. -

    DeepMOT (CVPR’20) [47]: DeepMOT introduces a deep Hungarian network that uses dual bidirectional recurrent neural networks (BRNN) [56] to convey global information within the cost matrix. Its loss function is designed based on two differentiable evaluation metrics, which optimize the network’s output allocation matrix to enhance accuracy.

  4. -

    TraDeS (CVPR’21) [35]: TraDeS is an online joint detection and tracking model that utilizes tracking cues to aid in end-to-end detection. It infers target tracking offsets from cost measures, which are used to propagate previous target features to improve current target detection and segmentation.

  5. -

    BotSORT (Arxiv’22) [48]: BotSORT integrates the advantages of motion and appearance information with camera motion compensation and a more precise Kalman filter state vector, achieving accurate tracking results.

  6. -

    ByteTrack (ECCV’22) [34]: ByteTrack employs a multi-match approach during tracking. It initially matches high-scoring detection boxes with existing tracks, then matches lower-scoring boxes with tracks that were not matched in the first round, addressing the scenario of object occlusions. Additionally, it solely uses the Kalman filter and Hungarian algorithm, eliminating the need for a ReID model.

  7. -

    UAVMOT (CVPR’22) [49]: UAVMOT is an algorithm for drone viewpoints, based on improvements over FairMOT. It incorporates three key components: an ID feature update module, an adaptive motion filter, and gradient balanced focal loss. These enhancements respectively strengthen the reID feature connections between adjacent frames, address the complex motion in drone video footage, and optimize the training of heatmaps.

  8. -

    StrongSORT (TMM’23) [50]: StrongSORT has made a series of improvements over DeepSORT, such as improving the appearance feature extractor, introducing an inertia term to smooth feature updates, utilizing a Kalman filter designed for nonlinear motion, and adding a cost matrix that incorporates motion information.

  9. -

    UCMCTrack (AAAI’24) [51]: In multi-object tracking, irregular camera motion has always been a challenge. This is because the rapid movement of the camera can cause abrupt changes in the positions of the objects in the frame, making it difficult to associate them with their past trajectories. UCMCTrack establishes a connection between the motion state of the objects and the ground in the image. It employs a mapped Mahalanobis Distance as an alternative to IoU to measure the similarity between the objects and their trajectories.

  10. -

    BoostTrack (MVA’24) [52]: Handling unreliable detections and avoiding identity switches are crucial for the success of multi-object tracking. BoostTrack incorporates a confidence score for detection tracklets and utilizes it to scale the similarity measure. To reduce ambiguities caused by using IoU, BoostTrack proposes a novel addition of Mahalanobis distance and shape similarity to enhance the overall similarity measure.

  11. -

    TLTDMOT (CVPR’24) [53]: TLTDMOT introduces a series of strategies to address the long-tail distribution problem in the field of multi-object tracking. These strategies include two data augmentation techniques, namely Static Camera Viewpoint Augmentation and Dynamic Camera Viewpoint Augmentation. Additionally, TLTDMOT incorporates a Group Softmax module for re-identification purposes.

These methods include both tracking by detection and end-to-end tracking approaches. Their improvement directions vary, with some focusing on filtering detection boxes and some on addressing camera shake issues. All these methods are trained and tested under a same configuration with our method.

Refer to caption
Figure 10: Each row in the figure represents four consecutive frames. The first row is PhyTracker, the second row is ByteTrack, and the third row is UCMCTrack. The category names from top to bottom are: sanguinea, chaetoceros curvisetus, cladocera, sanguinea, eucampia zoodiacus.

V-B Experimental Analysis

We evaluated 12 different methods on the phytoplankton dataset and the MOT17 dataset. For detailed metrics of the testing performance of the twelve methods on the phytoplankton dataset without noise, refer to Table II. MOTA and IDF1 metrics reflect the overall tracking performance, while FN and FP indicate detection performance. Among the five evaluation metrics, PhyTracker achieved the best results in IDF1, MOTA, and FN. Our ratings in FP and IDs are also better than most algorithms. The overall comparison results in the phytoplankton dataset are illustrated in Table V-A. PhyTracker achieved the best results in the dataset under noise-free conditions, as well as in scenarios with added noise, including occlusion, gray processing and salt-and-pepper noise. In the blurring data, PhyTracker performs slightly worse than BoostTrack [52]. This is because the blurring process degrades the image quality, leading to poorer feature extraction by the network. End-to-end algorithms like PhyTracker directly track using the extracted features, making them more susceptible to the effects of blurring. In contrast, tracking-by-detection algorithms like BoostTrack first generate bounding boxes and then pass them to the tracking component for further processing. During this process, bounding boxes with low confidence are not completely discarded, which reduces the impact of blurring on tracking-by-detection algorithms. In contrast to earlier methods such as DeepSort [46], our approach employs an end-to-end model, eliminating reliance on detection accuracy. Unlike recent methods like TLTDMOT [53], PhyTracker focuses more on the unique characteristics of phytoplankton data. By enhancing the feature representation and association of phytoplankton while mitigating the impact of moving features on overall recognition, PhyTracker consistently outperforms other methods.

To represent the differences between PhyTracker and other methods, we tested the data without added noise using PhyTracker, ByteTrack [34] and UCMCTrack [51], with the results shown in Fig. 10. As evident from the comparison, mainstream multi-object tracking methods do not perform well on phytoplankton data. The tracking performance of PhyTracker, ByteTrack, and UCMCTrack under occlusion conditions is shown in Fig. 11. After adding occlusion, the performance of ByteTrack and UCMCTrack decreased, but PhyTracker still performed well. Compared to these two methods, PhyTracker, which relies on the FMR module capable of associating all previous frames, is better suited for tracking plankton under more challenging conditions.

We conducted a comparison between TraDeS [35] and PhyTracker in MOT17 dataset, with the results illustrated in the provided Table IV. Training under the same conditions, PhyTracker performs better than TraDeS. This indicates that our improvements based on the characteristics of phytoplankton are still effective in pedestrian data, especially regarding the strategies for feature enhancement and strengthened association for tracking objects.

Refer to caption
Figure 11: Each row in the figure shows four consecutive frames: the first row is PhyTracker, the second row is ByteTrack, and the third row is UCMCTrack. The category names from top to bottom are: sanguinea, skeletonema, thalassionema nitzschioides, protoperidinium, tintinnid.
TABLE V: We tested the effectiveness of the three modules under different embedding scenarios, validating the efficacy of each module.
{tabu}

l|ccc|cc \tabucline[1.5pt]- Scheme & TFE ATA FMR IDF1\uparrow MOTA\uparrow
\tabucline[1.0pt]- 1 Baseline 85.4 78.3
2 \checkmark 87.2 80.2
3 \checkmark \checkmark 87.2 80.4
4 \checkmark \checkmark 87.5 80.5
5 \checkmark \checkmark 86.7 79.5
6 PhyTracker \checkmark \checkmark \checkmark 88.4 80.8
\tabucline[1.5pt]-

V-C Ablation Study

In this section, we validate the effectiveness of PhyTracker through ablation studies. All experiments were conducted on the phytoplankton dataset and its noise-augmented sections. To ensure a fair evaluation of each component’s performance, our training and testing details are consistent with those described in the Implementation Details section above. When testing a specific module, no changes were made to the remaining modules. Compared to TraDeS, PhyTracker focuses more on the feature representation, enhanced association, and trajectory influence of plankton. We implemented these aspects respectively in TFE, ATA, and FMR. To verify the effectiveness of the three modules we proposed, we conducted five sets of experiments. These involved incorporating only TFE under the same conditions, adding both TFE and ATA, adding both TFE and FMR, adding both ATA and FMR, and finally integrating all three modules. As shown in the Table V, there is a noticeable increase in scores after incorporating each module, compared to the baseline alone. With all three modules integrated, in comparison to the baseline, MOTA and IDF1 scores increased by 2.5% and 3%, respectively.

Effect of TFE. We tested the effectiveness of the TFE module details using phytoplankton data and conducted three sets of experiments for comparison:

1) Study on SRM Filters. In the TFE module, we employed three fixed SRM convolutional kernels for detecting weak edges, strong edges, and sharpening, respectively. As shown in Eq. 1. To validate the effectiveness of this combination, we compared it with another set of three fixed commonly used SRM convolutional kernels: horizontal edge detection, vertical edge detection, and sharpening. The results are shown in Table VI. Compared to commonly used SRM convolution kernels, the improved SRM convolution kernel we use has improved IDF1 and MOTA metrics by 1.9% and 1.2%, respectively. The commonly used convolution kernels for comparison are as follows:

[1242100000000000000012421][1000120002400042000210001][0010000100119110010000100]matrix1242100000000000000012421matrix1000120002400042000210001matrix0010000100119110010000100\displaystyle\small\begin{bmatrix}-1&-2&-4&-2&-1\\ 0&0&0&0&0\\ 0&0&0&0&0\\ 0&0&0&0&0\\ 1&2&4&2&1\\ \end{bmatrix}\begin{bmatrix}-1&0&0&0&1\\ -2&0&0&0&2\\ -4&0&0&0&4\\ -2&0&0&0&2\\ -1&0&0&0&1\\ \end{bmatrix}\begin{bmatrix}0&0&-1&0&0\\ 0&0&-1&0&0\\ -1&-1&9&-1&-1\\ 0&0&-1&0&0\\ 0&0&-1&0&0\\ \end{bmatrix}[ start_ARG start_ROW start_CELL - 1 end_CELL start_CELL - 2 end_CELL start_CELL - 4 end_CELL start_CELL - 2 end_CELL start_CELL - 1 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 1 end_CELL start_CELL 2 end_CELL start_CELL 4 end_CELL start_CELL 2 end_CELL start_CELL 1 end_CELL end_ROW end_ARG ] [ start_ARG start_ROW start_CELL - 1 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 1 end_CELL end_ROW start_ROW start_CELL - 2 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 2 end_CELL end_ROW start_ROW start_CELL - 4 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 4 end_CELL end_ROW start_ROW start_CELL - 2 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 2 end_CELL end_ROW start_ROW start_CELL - 1 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 1 end_CELL end_ROW end_ARG ] [ start_ARG start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL - 1 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL - 1 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL - 1 end_CELL start_CELL - 1 end_CELL start_CELL 9 end_CELL start_CELL - 1 end_CELL start_CELL - 1 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL - 1 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL - 1 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL end_ROW end_ARG ] (7)
TABLE VI: Mode1 is ours, and Mode2 is another set of convolutional kernels used for comparison.
{tabu}

c|cc|cc \tabucline[1.5pt]- PhyTracker & Mode1 Mode2 IDF1\uparrow MOTA\uparrow
\tabucline[1.0pt]- SRM \checkmark 86.5 79.6
SRM \checkmark 88.4 80.8
\tabucline[1.5pt]-

2) Study on TFE module. To validate the significant effectiveness of the TFE module on phytoplankton data, we conducted a comparison using the CBAM module, which integrates channel and spatial attention mechanisms. In this experiment, we replaced the TFE module in the model with the CBAM module, kee** all other parts unchanged. Using the TFE module compared to the CBAM module improved IDF1 and MOTA by 3.1% and 2.7%, respectively. As shown in the Table VII. Compared to other commonly used modules, the TFE module has advantages in extracting plankton features due to its unique structure and specifically designed SRM.

TABLE VII: The first line indicates the use of CBAM, and the second line indicates the use of TFE.
{tabu}

c|cc|cc \tabucline[1.5pt]- Baseline+ATA+FMR & CBAM TFE IDF1\uparrow MOTA\uparrow
\tabucline[1.0pt]- + \checkmark 85.3 78.1
+ \checkmark 88.4 80.8
\tabucline[1.5pt]-

3) Study on SIE module. In the TFE module, we utilize the SIE module to extract semantic information and enhance the feature representation of tracking targets. To validate the effectiveness of the SIE module, we removed it from the TFE module and conducted training under the same conditions. The results, as shown in the Table VIII, indicate that after removing the SIE module, the IDF1 and MOTA scores dropped by 1.4% and 0.9%, respectively. Compared to using backbone network alone for image feature extraction in TraDeS, the SIE module first employs 3×\times×3 convolutions and SRM convolutions to extract semantic information specific to phytoplankton. This enriched information is then passed into the backbone network for further feature extraction, resulting in more comprehensive corresponding features.

TABLE VIII: We removed the SIE sub-module from the TFE module while kee** all other conditions unchanged to verify the effect of SIE.
{tabu}

c|c|cc \tabucline[1.5pt]- PhyTracker & SIE IDF1\uparrow MOTA\uparrow
\tabucline[1.0pt]- TFE ×\times× 87.0 79.9
TFE \checkmark 88.4 80.8
\tabucline[1.5pt]-

Effect of ATA. In the ATA module, we refine the past frame feature ft1superscript𝑓𝑡1f^{t-1}italic_f start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT to compute query, and refine the current frame feature ftsuperscript𝑓𝑡f^{t}italic_f start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT to compute key and value. Attention computation is then performed in the second layer. To verify the effectiveness of this structure, we conducted two sets of experiments. 1) The overall structure is modified to sequentially perform attention computation twice, with the input to the second layer being the single tensor output from the first layer. 2) We exchanged the query, key, and value calculated for the past and current frames. As shown in Table IX, after altering the structure, the IDF1 and MOTA scores dropped by 1.1% and 0.4%, respectively. After swap** the variables corresponding to the two frames, the IDF1 and MOTA scores dropped by 2.0% and 0.7%, respectively. This indicates that simply applying or stacking transformer does not effectively suit the application of phytoplankton data and can cause the algorithm’s focus to shift. To eliminate this bias, we first refine the features of each frame and then apply attention calculations only to the refined features, thereby minimizing the influence of non-tracking targets as much as possible.

TABLE IX: Exp1 corresponds to 1), exp2 corresponds to 2), and exp3 is our method.
{tabu}

c|ccc|cc \tabucline[1.5pt]- PhyTracker & Exp1 Exp2 Exp3 IDF1\uparrow MOTA\uparrow
\tabucline[1.0pt]- ATA \checkmark 87.3 80.4
ATA \checkmark 86.4 80.1
ATA \checkmark 88.4 80.8
\tabucline[1.5pt]-

Effect of FMR. To verify the necessity of the dual-branch structure in the FMR module and the rationality of its computational design, we conducted two sets of experiments. 1) We removed the tracking offset branch for the current frame, retaining only the branch used to eliminate past motion characteristics. 2) We modified the matrix calculation for removing past characteristics, specifically changing the calculation of memory offset to λt=𝒪t+λt1subscript𝜆𝑡subscript𝒪𝑡subscript𝜆𝑡1\lambda_{t}={\mathcal{O}_{t}}+\lambda_{t-1}italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = caligraphic_O start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT. Removing one branch resulted in IDF1 and MOTA scores drop** by 1.0% and 0.5%, respectively. After modifying the matrix calculation, IDF1 and MOTA scores dropped by 0.4% and 0.3%, respectively. As shown in Table X, When we remove the tracking offset of the current frame, the tracker’s performance significantly declines due to the lack of this feature. However, unreasonable feature fusion can also negatively impact the tracker. Our method has achieved optimal results overall.

TABLE X: Exp1 corresponds to 1), exp2 corresponds to 2), and exp3 is our method.
{tabu}

c|ccc|cc \tabucline[1.5pt]- PhyTracker & Exp1 Exp2 Exp3 IDF1\uparrow MOTA\uparrow
\tabucline[1.0pt]- FMR \checkmark 87.4 80.3
FMR \checkmark 88.0 80.5
FMR \checkmark 88.4 80.8
\tabucline[1.5pt]-

VI Conclusion

In this paper, we develop PyTracker, an intelligent in situ tracking framework to address the critical challenges of phytoplankton monitoring. Unlike traditional non-in situ methods, PyTracker offers an automated and timely solution for monitoring phytoplankton, thereby enhancing the efficiency and accuracy of ecosystem analysis. Confronting difficulties in this task, including their constrained mobility, inconspicuous appearance, and the presence of impurities in water samples, our method proposes three novel modules: the Texture-enhanced Feature Extraction (TFE) module for improved feature capture, the Attention-enhanced Temporal Association (ATA) module for distinguishing phytoplankton from impurities, and the Flow-agnostic Movement Refinement (FMR) module for refining movement characteristics. These innovations collectively enhance the tracking performance and reliability of the system.

Through extensive experiments on the PMOT dataset, PyTracker has demonstrated superior performance in tracking phytoplankton. Moreover, our method has shown its versatility and effectiveness on the MOT dataset, surpassing conventional tracking methods. This work not only underscores the importance of tailored tracking solutions for aquatic environments but also sets a foundation for future advancements in marine ecological monitoring and scientific exploration.

VII Acknowledgments

This work was supported in part by the National Key R&D Program of China under Grant 2022ZD0117201.

References

  • [1] Tadashi Matsunaga, Haruko Takeyama, Hideki Miyashita, and Hiroko Yokouchi. Marine microalgae. Marine biotechnology I, pages 165–188, 2005.
  • [2] Yimin Chen, Changan Xu, and Seetharaman Vaidyanathan. Microalgae: a robust “green bio-bridge” between energy and environment. Critical reviews in biotechnology, 38(3):351–368, 2018.
  • [3] I Ahmad, A Yuzir, SE Mohamad, K Iwamoto, and N Abdullah. Role of microalgae in sustainable energy and environment. In IOP Conference Series: Materials Science and Engineering, volume 1051, page 012059. IOP Publishing, 2021.
  • [4] Eva Álvarez, Ángel López-Urrutia, Enrique Nogueira, and Santiago Fraga. How to effectively sample the plankton size spectrum? a case study using flowcam. Journal of Plankton Research, 33(7):1119–1133, 2011.
  • [5] Yanwu Zhang, Brian Kieft, Robert McEwen, Jordan Stanway, James Bellingham, John Ryan, Brett Hobson, Douglas Pargett, James Birch, and Christopher Scholin. Tracking and sampling of a phytoplankton patch by an autonomous underwater vehicle in drifting mode. In OCEANS 2015-MTS/IEEE Washington, pages 1–5. IEEE, 2015.
  • [6] Yanwu Zhang, Michael A Godin, Brian Kieft, Ben-Yair Raanan, John P Ryan, and Brett W Hobson. Finding and tracking a phytoplankton patch by a long-range autonomous underwater vehicle. IEEE Journal of Oceanic Engineering, 47(2):322–330, 2021.
  • [7] Ryan N Smith, Jnaneshwar Das, Yi Chao, David A Caron, Burton H Jones, and Gaurav S Sukhatme. Cooperative multi-auv tracking of phytoplankton blooms based on ocean model predictions. In OCEANS’10 IEEE SYDNEY, pages 1–10. IEEE, 2010.
  • [8] Jiaao Yu, Qingxuan Lv, Yuezun Li, Junyu Dong, Haoran Zhao, and Qiong Li. Pmot2023: A large-scale multi-object tracking (mot) dataset with application to phytoplankton observation. Journal of Marine Science and Engineering, 11(6):1141, 2023.
  • [9] Anton Milan, Laura Leal-Taixé, Ian Reid, Stefan Roth, and Konrad Schindler. Mot16: A benchmark for multi-object tracking. arXiv preprint arXiv:1603.00831, 2016.
  • [10] Laurence A Anderson. On the hydrogen and oxygen content of marine phytoplankton. Deep sea research part I: Oceanographic research papers, 42(9):1675–1680, 1995.
  • [11] Kevin G Sellner and Michael E Kachur. Phytoplankton: relationships between phytoplankton, nutrients, oxygen flux and secondary producers. Ecological studies in the middle reach of Chesapeake Bay, 23:11–37, 1987.
  • [12] Tom Fenchel. Marine plankton food chains. Annual Review of Ecology and Systematics, 19(1):19–38, 1988.
  • [13] Daniel G Boyce and Boris Worm. Patterns and ecological implications of historical marine phytoplankton change. Marine Ecology progress series, 534:251–272, 2015.
  • [14] CS Reynolds. Phytoplankton periodicity: the interactions of form, function and environmental variability. Freshwater biology, 14(2):111–142, 1984.
  • [15] Andrew J Irwin, Zoe V Finkel, Frank E Müller-Karger, and Luis Troccoli Ghinaglia. Phytoplankton adapt to changing ocean environments. Proceedings of the National Academy of Sciences, 112(18):5762–5766, 2015.
  • [16] P Tett, C Carreira, DK Mills, S Van Leeuwen, J Foden, E Bresnan, and RJ Gowen. Use of a phytoplankton community index to assess the health of coastal waters. ICES Journal of Marine Science, 65(8):1475–1482, 2008.
  • [17] Monika Winder and Ulrich Sommer. Phytoplankton response to a changing climate. Hydrobiologia, 698:5–16, 2012.
  • [18] Theodore J Smayda. Harmful algal blooms: their ecophysiology and general relevance to phytoplankton blooms in the sea. Limnology and oceanography, 42(5part2):1137–1153, 1997.
  • [19] Malgorzata Ciesielska, Katarzyna W Boström, and Magnus Öhlander. Observation methods. Qualitative methodologies in organization studies: Volume II: Methods and possibilities, pages 33–52, 2018.
  • [20] George J McCall. Systematic field observation. Annual review of sociology, 10(1):263–282, 1984.
  • [21] Ian Joint and Stephen B Groom. Estimation of phytoplankton production from space: current status and future potential of satellite remote sensing. Journal of experimental marine Biology and Ecology, 250(1-2):233–255, 2000.
  • [22] Zhihong Sun, Jun Chen, Liang Chao, Weijian Ruan, and Mithun Mukherjee. A survey of multiple pedestrian tracking based on tracking-by-detection framework. IEEE Transactions on Circuits and Systems for Video Technology, 31(5):1819–1833, 2021.
  • [23] Chong Sun, Fu Li, Huchuan Lu, and Gang Hua. Visual tracking via joint discriminative appearance learning. IEEE Transactions on Circuits and Systems for Video Technology, 27(12):2567–2577, 2016.
  • [24] Yabin Zhu, Chenglong Li, ** Tang, Bin Luo, and Liang Wang. Rgbt tracking by trident fusion network. IEEE Transactions on Circuits and Systems for Video Technology, 32(2):579–592, 2021.
  • [25] Rui Yao, Guosheng Lin, Chunhua Shen, Yanning Zhang, and Qinfeng Shi. Semantics-aware visual object tracking. IEEE Transactions on Circuits and Systems for Video Technology, 29(6):1687–1700, 2018.
  • [26] Chenglong Li, Chengli Zhu, Jian Zhang, Bin Luo, Xiaohao Wu, and ** Tang. Learning local-global multi-graph descriptors for rgb-t object tracking. IEEE Transactions on Circuits and Systems for Video Technology, 29(10):2913–2926, 2018.
  • [27] Michael I Jordan. Graphical models. 2004.
  • [28] E. Trucco and K. Plakas. Video tracking: A concise survey. IEEE Journal of Oceanic Engineering, 31(2):520–529, 2006.
  • [29] Li Wang, Ting Liu, Gang Wang, Kap Luk Chan, and Qingxiong Yang. Video tracking using learned hierarchical features. IEEE Transactions on Image Processing, 24(4):1424–1435, 2015.
  • [30] Weijiang Feng, Long Lan, Yong Luo, Yue Yu, Xiang Zhang, and Zhigang Luo. Near-online multi-pedestrian tracking via combining multiple consistent appearance cues. IEEE Transactions on Circuits and Systems for Video Technology, 31(4):1540–1554, 2021.
  • [31] Xiankai Lu, Chao Ma, Bingbing Ni, and Xiaokang Yang. Adaptive region proposal with channel regularization for robust object tracking. IEEE Transactions on Circuits and Systems for Video Technology, 31(4):1268–1282, 2021.
  • [32] Peng Zhang, Shujian Yu, Jiamiao Xu, Xinge You, Xiubao Jiang, Xiao-Yuan **g, and Dacheng Tao. Robust visual tracking using multi-frame multi-feature joint modeling. IEEE Transactions on Circuits and Systems for Video Technology, 29(12):3673–3686, 2019.
  • [33] Minghua Zhang, Qiuyang Zhang, Wei Song, Dongmei Huang, and Qi He. Promptvt: Prompting for efficient and accurate visual tracking. IEEE Transactions on Circuits and Systems for Video Technology, pages 1–1, 2024.
  • [34] Yifu Zhang, Peize Sun, Yi Jiang, Dongdong Yu, Fucheng Weng, Zehuan Yuan, ** Luo, Wenyu Liu, and Xinggang Wang. Bytetrack: Multi-object tracking by associating every detection box. In European conference on computer vision, pages 1–21. Springer, 2022.
  • [35] Jialian Wu, Jiale Cao, Liangchen Song, Yu Wang, Ming Yang, and Junsong Yuan. Track to detect and segment: An online multi-object tracker. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12352–12361, 2021.
  • [36] Yu Quan, Dong Zhang, Liyan Zhang, and **hui Tang. Centralized feature pyramid for object detection. IEEE Transactions on Image Processing, 2023.
  • [37] Khurram Azeem Hashmi, Goutham Kallempudi, Didier Stricker, and Muhammad Zeshan Afzal. Featenhancer: Enhancing hierarchical features for object detection and beyond under low-light vision. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6725–6735, 2023.
  • [38] Fisher Yu, Vladlen Koltun, and Thomas Funkhouser. Dilated residual networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 472–480, 2017.
  • [39] Fisher Yu, Dequan Wang, Evan Shelhamer, and Trevor Darrell. Deep layer aggregation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2403–2412, 2018.
  • [40] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  • [41] Zhuoran Shen, Mingyuan Zhang, Haiyu Zhao, Shuai Yi, and Hongsheng Li. Efficient attention: Attention with linear complexities. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pages 3531–3539, 2021.
  • [42] Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, Han Hu, and Yichen Wei. Deformable convolutional networks. In Proceedings of the IEEE international conference on computer vision, pages 764–773, 2017.
  • [43] Xingyi Zhou, Dequan Wang, and Philipp Krähenbühl. Objects as points. arXiv preprint arXiv:1904.07850, 2019.
  • [44] Keni Bernardin and Rainer Stiefelhagen. Evaluating multiple object tracking performance: the clear mot metrics. EURASIP Journal on Image and Video Processing, 2008:1–10, 2008.
  • [45] Alex Bewley, Zongyuan Ge, Lionel Ott, Fabio Ramos, and Ben Upcroft. Simple online and realtime tracking. In 2016 IEEE international conference on image processing (ICIP), pages 3464–3468. IEEE, 2016.
  • [46] Nicolai Wojke, Alex Bewley, and Dietrich Paulus. Simple online and realtime tracking with a deep association metric. In 2017 IEEE international conference on image processing (ICIP), pages 3645–3649. IEEE, 2017.
  • [47] Yihong Xu, Yutong Ban, Xavier Alameda-Pineda, and Radu Horaud. Deepmot: A differentiable framework for training multiple object trackers. arXiv preprint arXiv:1906.06618, 10(11), 2019.
  • [48] Nir Aharon, Roy Orfaig, and Ben-Zion Bobrovsky. Bot-sort: Robust associations multi-pedestrian tracking. arXiv preprint arXiv:2206.14651, 2022.
  • [49] Shuai Liu, Xin Li, Huchuan Lu, and You He. Multi-object tracking meets moving uav. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8876–8885, 2022.
  • [50] Yunhao Du, Zhicheng Zhao, Yang Song, Yanyun Zhao, Fei Su, Tao Gong, and Hongying Meng. Strongsort: Make deepsort great again. IEEE Transactions on Multimedia, 2023.
  • [51] Kefu Yi, Kai Luo, Xiaolei Luo, Jiangui Huang, Hao Wu, Rongdong Hu, and Wei Hao. Ucmctrack: Multi-object tracking with uniform camera motion compensation. Proceedings of the AAAI Conference on Artificial Intelligence, 38(7):6702–6710, Mar. 2024.
  • [52] Vukasin D Stanojevic and Branimir T Todorovic. Boosttrack: boosting the similarity measure and detection confidence for improved multiple object tracking. Machine Vision and Applications, 35(3), 2024.
  • [53] Sijia Chen, En Yu, **yang Li, and Wenbing Tao. Delving into the trajectory long-tail distribution for muti-object tracking. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2024.
  • [54] Zhi Tian, Chunhua Shen, and Hao Chen. Conditional convolutions for instance segmentation. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I 16, pages 282–298. Springer, 2020.
  • [55] Hao Luo, Youzhi Gu, Xingyu Liao, Shenqi Lai, and Wei Jiang. Bag of tricks and a strong baseline for deep person re-identification. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pages 0–0, 2019.
  • [56] Mike Schuster and Kuldip K Paliwal. Bidirectional recurrent neural networks. IEEE transactions on Signal Processing, 45(11):2673–2681, 1997.