PhyTracker: An Online Tracker for Phytoplankton

Yang Yu, Qingxuan Lv, Yuezun Li, Zhiqiang Wei, Junyu Dong Yuezun Li and Junyu Dong are corresponding authors.Yang Yu, Qingxuan Lv, Yuezun Li, Zhiqiang Wei, and Junyu Dong are with the College of Computer Science and Technology, Ocean University of China, China. e-mail: ([email protected]; [email protected]; [email protected]; [email protected]; [email protected]).

Abstract

Phytoplankton, a crucial component of aquatic ecosystems, requires efficient monitoring to understand marine ecological processes and environmental conditions. Traditional phytoplankton monitoring methods, relying on non-in situ observations, are time-consuming and resource-intensive, limiting timely analysis. To address these limitations, we introduce PhyTracker, an intelligent in situ tracking framework designed for automatic tracking of phytoplankton. PhyTracker overcomes significant challenges unique to phytoplankton monitoring, such as constrained mobility within water flow, inconspicuous appearance, and the presence of impurities. Our method incorporates three innovative modules: a Texture-enhanced Feature Extraction (TFE) module, an Attention-enhanced Temporal Association (ATA) module, and a Flow-agnostic Movement Refinement (FMR) module. These modules enhance feature capture, differentiate between phytoplankton and impurities, and refine movement characteristics, respectively. Extensive experiments on the PMOT dataset validate the superiority of PhyTracker in phytoplankton tracking, and additional tests on the MOT dataset demonstrate its general applicability, outperforming conventional tracking methods. This work highlights key differences between phytoplankton and traditional objects, offering an effective solution for phytoplankton monitoring.

Index Terms:

Phytoplankton Observing and Analysis, Object Tracking

I Introduction

Phytoplankton is a general term for plant microorganisms, particularly referring to microalgae (see Fig. 1) [1]. Phytoplankton is a vital component of aquatic ecosystems, with their activities serving as key indicators for marine ecological processes and environmental conditions [2]. Consequently, monitoring phytoplankton holds significant importance in maintaining the stability of aquatic ecosystems, safeguarding water resources, and advancing scientific exploration [3].

Traditional efforts on monitoring phytoplankton mainly rely on the so-called non-in situ observation approach, that is to collect water samples and bring them back to the laboratory with manual observation [4]. This approach not only consumes considerable time and human resources but also fails to analyze phytoplankton timely. To overcome this limitation, we develop an intelligent tracking framework, called PyTracker, that can be deployed on the ocean to monitor phytoplankton in a way of in situ observations. This framework is designed to automatically localize and categorize phytoplankton and then track them constantly observed in the microscope. The results can provide versatile information in monitoring phytoplankton, and can be utilized for further analysis such as density estimation, action recognition, pose estimation, etc.

Refer to caption — Figure 1: The pedestrian dataset (on the left) and the planktonic dataset (on the right) have different characteristics.

However, tracking phytoplankton is significantly more challenging compared to the scenario of tracking general objects on the ground, which mainly lies in the following three aspects:

1.

Inconspicuous appearance: Phytoplankton commonly exhibit tiny sizes, light colors, erratic forms, and straightforward textures. These characteristics significantly increase the challenge of identification than general objects on the ground.
2.

Complex monitor scenario: The water samples usually contain impurities dispersed throughout the entire view of observation. These impurities look very similar to phytoplankton, posing challenges to the accurate monitoring of phytoplankton.
3.

Different monitor pipeline: Tracking objects on the ground, particularly pedestrians and vehicles, typically utilizes general video cameras and is performed under a wild scenario. However, the pipeline of our task is notably different, which involves extracting water samples from the ocean and gradually passing them through the microscopes for analysis. Under this scenario, the mobility of phytoplankton is highly constrained, and their movement within this pipeline is mainly driven by water flow, resulting in a highly uniform trajectory of these phytoplankton.

These differences greatly limit the application of conventional tracking methods for ground monitoring to phytoplankton monitoring [5, 6, 7]. As such, develo** a tracking framework devoted to this task is necessary.

In this paper, we conduct an in-depth analysis of the unique characteristics of phytoplankton and proposes a new method called PhyTracker devoted to accomplishing the tracking of phytoplankton. Our method achieves tracking in an online manner, which continuously track the phytoplankton alongside water flows with a meticulously designed architecture. Specifically, our method features three designs: 1) Since the appearance of phytoplankton is inconspicuous, we describe a Texture-enhanced Feature Extraction (TFE) module to improve the capture of appearance features, with the incorporation of dilated convolutions and SRM filters. 2) Floating impurities likely disrupt the temporal association of phytoplankton, causing the chaotic target correlation between frames before and after the tracking process. To mitigate this, we propose an Attention-enhanced Temporal Association (ATA) module to tell apart phytoplankton and impurities. The core of this module is an attention mechanism, which can effectively associate corresponding features from consecutive frames, eliminating the interference caused by impurities. 3) Trajectories are an important characteristic of phytoplankton. However, in our monitor pipeline, trajectories of phytoplankton are highly consistent, concealing the characteristic of phytoplankton movement. Therefore, we describe a Flow-agnostic Movement Refinement (FMR) module, which can recover the characteristics of each phytoplankton movement, making phytoplankton more discriminative.

Extensive experiments are conducted on a large-scale public phytoplankton tracking dataset (PMOT) [8], demonstrating the superiority of our method in tracking phytoplankton. Moreover, we validate our method on the general object tracking dataset (MOT) [9] in comparison to recent conventional tracking methods. The results surprisingly corroborate that our method can still outperform others.

The contributions of this paper can be summarized in three-fold:

1.

We thoroughly examine the major differences between traditional objects (such as pedestrians and vehicles) and phytoplankton, highlighting three key aspects: different monitor pipelines, inconspicuous appearance, and complex monitor scenarios.
2.

To address these challenges, we propose an online tracker with three key improvements: a Texture-enhanced Feature Extraction (TFE) module, an Attention-enhanced Temporal Association (ATA) module and a Flow-agnostic Movement Refinement (FMR) module. Each module is developed to handle corresponding challenge.
3.

We conduct comprehensive experiments on both the phytoplankton dataset and the general object tracking dataset. The results demonstrate the effectiveness of the proposed method, particularly in the context of phytoplankton tracking.

II Backgrounds and Related Works

Phytoplankton. Phytoplankton are plant-like life forms (see Fig.1) that play an indispensable role in marine ecosystems. Through photosynthesis, they absorb carbon dioxide and release oxygen, providing a crucial source of oxygen for marine life, while also fixing carbon in organic matter [10]. When they die, some of the organic carbon settles to the ocean floor, forming sediments and participating in long-term carbon storage and the Earth’s carbon cycle [11]. Phytoplankton are also the foundation of the marine food chain. They are consumed by zooplankton and other organisms, thereby supporting the entire marine ecosystem [12]. Additionally, phytoplankton play a role in regulating the global climate by modulating the reflectivity of the ocean surface, and by absorbing and releasing heat. They also impact the chemical composition of the atmosphere, thus affecting atmospheric circulation and climate.

Monitoring Phytoplankton. Observing and real-time monitoring of phytoplankton species, density, and concentration have significant implications for humans and nature [13, 14, 15]. Firstly, changes in phytoplankton species and density can reflect the ecological health of water bodies. By monitoring phytoplankton, we can promptly detect abnormal changes in ecosystems and take appropriate measures [16]. Secondly, phytoplankton are very sensitive to environmental factors such as light, temperature, and carbon dioxide. Monitoring changes in phytoplankton can provide valuable data on climate change [17] Moreover, phytoplankton form the base of the marine food chain. Their density and distribution directly affect the reproduction and survival of other marine organisms. By monitoring phytoplankton, fishery managers can predict the density and distribution of fish resources, thereby formulating more effective fishery management strategies [13]. Additionally, certain species of phytoplankton can proliferate under specific conditions, forming harmful algal blooms (such as red tides), which lead to oxygen depletion in water bodies, release toxins, and pose threats to aquatic life and human health. Real-time monitoring of phytoplankton can provide early warnings of harmful algal blooms, reducing their negative impacts [18].

Traditional monitoring methods, referred to as non-in-situ observation, involve the collection of water samples and their observation under microscopes by trained personnel [4, 19, 20]. However, these techniques require a lot of time and human resources and lack the ability for timely phytoplankton analysis. Recently, advancements in hardware have made in-situ observation feasible by integrating digital microscopes, flow pumps, and computational chips into a single device [21]. While this device shows promising potential for phytoplankton monitoring, the algorithms specifically dedicated to this task remains unexplored. Typically, tracking is the prerequisite task for monitoring phytoplankton. Given the tracking results, we can futher analyze the activities of phytoplankton. However, existing tracking algorithms are designed for ground scenario. Compared to the ground scenario, tracking phytoplankton poses unique challenges due to the different monitor pipelines, the inconspicuous appearance of phytoplankton, and the complexity of monitoring scenario. Therefore, there is an urgent need to develop a devoted tracking framework for phytoplankton.

Object Tracking. The existing tracking methods are mainly designed for ground scenarios, focusing on tracking general objects such as vehicles and pedestrians [22, 23, 24, 25, 26]. To obtain the trajectory of objects, Object extraction, temporal association and motion prediction are three important aspects in multi-object tracking within video sequences. According to the tracking pipeline, existing methods can be divided into two categories: Offline tracking and Online tracking.

Offline Tracking. Offline tracking allows the use of information from subsequent frames and is formulated as a graph model for a globally optimal solution [27, 28, 29]. However, this setup makes it not suitable for practical applications due to its reliance on future frame data.

Online Tracking. Online multi-object tracking involves calculating the match between current object detection and existing trajectories based on the available data [30, 31, 32, 33]. The nature of online tracking requires that the decision for each frame’s tracking outcome must rely solely on information from the current and previous frames. It is imperative that the algorithm cannot use the current frame’s data to alter the results of previous frames. Therefore, this paper focuses on the setting of onlione tracking.

III Motivation and Preliminary Validation

The straightforward solution for tracking phytoplankton is to directly adapt existing tracking methods into this task. However, these methods are designed for ground scenarios and focus on tackling common challenges such as occlusions, target matching between frames, and camera jitters. Thus they do not align with the scenario of phytoplankton in aquatic environments. Unlike conventional tracking targets such as pedestrian and vehicles, phytoplankton exhibits significant differences in data characteristics, including their unique low-contrast color features, the stark contrast between aquatic and ground environments, and their distinct movement patterns compared to other organisms. These differences significantly hinder the application of existing methods to phytoplankton monitoring. To validate this, we adopt the recent tracking methods ByteTrack [34] to our task. As shown in Fig. 3, it can be seen that the generated attention heatmaps of ByteTrack can hardly concentrate on the phytoplankton, indicating the infeasibility of directly adopting existing tracking methods.

IV Method

This paper describes an online tracker, PhyTracker, devoted to monitoring phytoplankton. Compared to existing tracking methods, our method features three major improvements. First, we introduce a Texture-enhanced Feature Extraction (TFE) module to enhance the appearance distinction of phytoplankton. This improvement enables the phytoplankton becoming more detectable. Second, we propose an Attention-enhanced Temporal Association (ATA) module, which optimizes the feature distances between targets in adjacent frames, enhancing the capacity of model to distinguish between similar phytoplankton, as well as impurities and phytoplankton. Furthermore, we introduce a Flow-agnostic Movement Refinement (FMR) module, which effectively reduces feature confusion from similar motion trajectories between different tracking entities and preserves original movement offset information, thereby enhancing sensitivity to individual movement characteristics. These three improvements correspond to solving the three difficulties in phytoplankton monitoring, as described in Sec.1.

IV-A Problem Setup

Denote a video sequence with total $N$ frames as $\mathcal{V}=\{\mathcal{I}_{t}\}_{t=1}^{N}$ , where $\mathcal{I}_{t}\in\mathbb{R}^{h\times w\times 3}$ refers to the $t$ -th frame. Suppose this video sequence contains $\mathcal{D}$ phytoplankton. The goal of our method is to output the trajectories of all phytoplankton $\{\mathcal{T}_{i}\}_{i=1}^{\mathcal{D}}$ within the video sequence $\mathcal{V}$ . The $i$ -th trajectory is a collection of bounding boxes at corresponding frames and its class label, defined as $\mathcal{T}_{i}=\{(b_{i,t_{1}},...,b_{i,t_{N}}),y_{i}\}$ , where $b_{i,t_{j}}$ is the bounding box of $i$ -th phytoplankton at temporal index $t_{j}$ , $t_{N}$ is the length of trajectory and $y_{i}$ represents category.

IV-B Framework Workflow

The workflow of our method is inspired by TraDes [35]. Given a frame $\mathcal{I}_{t}$ , it is first passed into Texture-enhanced Feature Extraction (TFE) module to extract appearance features of phytoplankton as $f^{t}\in\mathbb{R}^{\frac{h}{4}\times\frac{w}{4}\times 64}$ . Then the feature of current frame $f^{t}$ and the one of previous frame $f^{t-1}$ are sent into the Attention-enhanced Temporal Association (ATA) module, to build the correlations between these two features. Based on the correlations, ATA can predict the movement offsets of phytoplankton. Then the movement offsets are forwarded into Flow-agnostic Movement Refinement (FMR) module, which eliminates the effect of water flows and fuses the knowledge from previous frames into the head network for final prediction.

IV-C Texture-enhanced Feature Extraction

Phytoplankton often exhibit appearance similar to their natural aquatic environments, making it difficult for traditional methods to capture discriminative features. In light of this, we propose a Texture-enhanced Feature Extraction (TFE) module to enhance feature extraction of phytoplankton (see Fig. 4). Inspired by [36, 37], this module extracts the noise information combined with semantic information to enrich the representation of phytoplankton features. Specifically, to amplify the subtle textures, we employ dilated convolutions [38] to expand the receptive field without increasing computational load. Then several SIE blocks are proposed to refine the features. SIE block is composed of convolution layers and SRM layers. SRM filters were originally designed to address the issues of image denoising and edge preservation in image processing. Under the microscope, there are situations where the flow image is unclear and there are many impurities. In response to this, we have modified the SRM filter to match the effectiveness of extracting additional information from phytoplankton data. The SRM layer is three 5 $\times$ 5 convolutional kernels with fixed values that remain unchanged, the kernels are:

\displaystyle\begin{bmatrix}0&0&0&0&0\\ 0&-1&2&-1&0\\ 0&2&-4&2&0\\ 0&-1&2&-1&0\\ 0&0&0&0&0\end{bmatrix}\begin{bmatrix}-1&2&-2&2&-1\\ 2&-6&8&-6&2\\ -2&8&-12&8&-2\\ 2&-6&8&-6&2\\ -1&2&-2&2&-1\end{bmatrix}\begin{bmatrix}-1&-1&-1&-1&-1\\ -1&0&0&0&-1\\ -1&0&8&0&-1\\ -1&0&0&0&-1\\ -1&-1&-1&-1&-1\end{bmatrix}

(1)

The features from SIE blocks are then integrated with the original images, which are sent into a DLA34 network [39]. This module can reveal differences in textures between phytoplankton and the aquatic environment, which are not easily observable in the RGB space.

IV-D Attention-enhanced Temporal Association

Effectively associating the features of phytoplankton across frames is crucial for accurate tracking results. However, temporal association in our task is challenging, due to the widespread impurities in observing scenario and similar-appearance phytoplankton of different types.

To address this issue, we propose an Attention-enhanced Temporal Association (ATA) module, which is designed to effectively find the feature association of same target across consecutive frames. The core of this module is a newly proposed Two-Stage Cross-Attention (TSCA) operations based on existing self-attention mechanism [40, 41]. As shown in Fig. 5, the first stage is a feature refinement block, which enhances the features of current frame $f^{t}$ and previous frame $f^{t-1}$ . Then the enhanced features are executed a cross-attention operation to model the temporal associations.

Specifically, in the first stage, the feature $f^{t-1}$ is sent into the CP and CB modules, and then added with $f^{t-1}$ for refinement. Different from $f^{t-1}$ , the feature $f^{t}$ is separately processed by Conv and CP modules, and calculates attention by matrix multiplication. The multiplication result is then sent into CB block, and perform residual connection with $f^{t}$ for refinement. In the second stage, the refined features are performed similar operations. But the difference is that we performance cross-attention operations to obtain the intermediate feature $\omega^{t}$ . Let $Q$ represent query features obtained from $f^{t-1}$ , and $K,V$ represent key and value features obtained from $f^{t}$ . Inspired by [41], the cross-attention operation can be defined as

\textrm{CA}(Q,K,V)={\phi_{q}}(Q)({\phi_{k}}(K)^{T}V),

(2)

where $\phi_{q}$ and $\phi_{k}$ are the normalization functions for query and key features, implementing by normalization methods:

		$\displaystyle{\phi_{q}}(Q)={\textrm{softmax}_{row}}(Q)$		(3)
		$\displaystyle{\phi_{k}}(K)={\textrm{softmax}_{col}}(K),$		(3)

note that $\textrm{softmax}_{row}$ , $\textrm{softmax}_{col}$ denotes the application of the softmax function along each row or column of the corresponding input.

It is important to note that the feature $\omega^{t}$ represents a feature that enhances the association of the same tracking target in the current frame feature $f^{t}$ and the past frame feature $f^{t-1}$ .

Inspired by [35], we then seek the similarity between adjacent features $\omega^{t-1}$ and $\omega^{t}$ to obtain higher level associations. These two features are sent into a convolution block $\sigma$ and then perform matrix multiplication with each other to create a four-dimensional similarity matrix $\mathcal{S}\in\mathbb{R}^{h^{\prime}\times w^{\prime}\times h^{\prime}\times w% ^{\prime}}$ , where $h^{\prime}=\frac{h}{8},w^{\prime}=\frac{w}{8}$ . The element $\mathcal{S}(i,j,k,l)$ denotes the similarity between the location of $(i,j)$ in the current frame $t$ and the location of $(k,l)$ in the previous frame $t-1$ (see Fig. 6).

Based on this similarity matrix $\mathcal{S}$ , we can generate the offset information following [35]. As shown in Fig. 7, for a phytoplankton centered at the location of $(i,j)$ in the current frame $t$ , we can fetch its two-dimensional similarity map $\mathcal{C}_{i,j}\in\mathbb{R}^{h^{\prime}\times w^{\prime}}$ from matrix $\mathcal{S}$ . This similarity map $\mathcal{C}_{i,j}$ stores the similarities among the phytoplankton and all locations in the previous frame $t-1$ . To calculate the offset, $\mathcal{C}_{i,j}$ is first max pooled by $1\times w^{\prime}$ and $h^{\prime}\times 1$ kernels and then normalized by a softmax function to obtain two vectors, $\mathcal{C}^{\mathcal{X}}_{i,j}\in[0,1]^{1\times w^{\prime}}$ and $\mathcal{C}^{\mathcal{Y}}_{i,j}\in[0,1]^{h^{\prime}\times 1}$ , respectively. These two vectors represent the likelihood of this phytoplankton appearing on horizontal and vertical locations in frame $t$ . Then we create two offset templates $\mathcal{T}^{\mathcal{X}}_{i,j}\in\mathbb{R}^{1\times w^{\prime}},\mathcal{T}^% {\mathcal{Y}}_{i,j}\in\mathbb{R}^{h^{\prime}\times 1}$ in the horizontal and vertical directions, respectively, which are calculated by

	$\displaystyle\mathcal{T}^{\mathcal{X}}_{i,j}(l)$	$\displaystyle=(l-j)\times s,$	$\displaystyle 1\leq l\leq w^{\prime}$		(4)
	$\displaystyle\mathcal{T}^{\mathcal{Y}}_{i,j}(k)$	$\displaystyle=(k-i)\times s,$	$\displaystyle 1\leq k\leq h^{\prime}$		(4)

where $s$ is the feature stride of $\omega_{s}^{t}$ . $\mathcal{T}^{\mathcal{X}}_{i,j}(l)$ refers to the horizontal offset when the phytoplankton appears at the location of $(:,l)$ in frame $t-1$ . Let $\mathcal{O}_{t}=[\mathcal{O}^{\mathcal{X}}_{t},\mathcal{O}^{\mathcal{Y}}_{t}]$ be the offset information, containing the horizontal and vertical offsets respectively. Each offset can be inferred by the dot product between the likelihoods and actual offset values as

\mathcal{O}^{\mathcal{X}}_{t}=\mathcal{C}_{i,j}^{\mathcal{X}}\mathcal{T}^{% \mathcal{X}}_{i,j},\;\mathcal{O}^{\mathcal{Y}}_{t}=\mathcal{C}_{i,j}^{\mathcal% {Y}}\mathcal{T}^{\mathcal{Y}}_{i,j}.

(5)

By applying this offset information, we can obtain the temporal association between the previous frame and the current frame. This offset information is then sent into the third module FMR to eliminate the impact of displacement caused by phytoplankton in the past frames.

IV-E Flow-agnostic Movement Refinement

The motion characteristics of phytoplankton under this scenario are mainly driven by water currents, exhibiting highly consistent movement trajectory characteristics within the pipeline. This phenomenon may lead to an undue reduction of feature differences of different phytoplankton classes, degrading the tracking performance.

To this end, we propose a Flow-agnostic Movement Refinement (FMR) module (see Fig. 8). This module aims to eliminate the effect of similar motion trajectories caused by water flow and amplify the differentiation between different phytoplankton individuals at the feature level. To achieve this, we need to first estimate the movement of water flow. Since the movement of all phytoplankton is driven by water flow, we can average the movement of all phytoplankton over all frames to represent water movement. Nevertheless, our framework is running online, which means it is impossible to access the future frames. To compensate, we maintain an offset memory bank and store the offset $\mathcal{O}_{t}$ constantly. Then we can average the offset in the bank as $\lambda_{t}=\frac{1}{t}\sum(\mathcal{O}_{t}+...+\mathcal{O}_{0})$ . In the current frame $t$ , we subtract $\lambda$ from the offset $\mathcal{O}_{t}$ to obtain the movement characteristics of phytoplankton, which serves as an extra feature added with the original offset $\mathcal{O}_{t}$ to form flow-agnostic feature $\Omega_{t}$ . This process can be written as $\Omega_{t}=2\mathcal{O}_{t}-\lambda_{t}$ .

Then we propagate the features $\mathcal{H}^{t-1}$ from the previous frame to the current refined offset $\Omega_{t}$ using a deformable convolution [42], inspired by [35]. The feature $\mathcal{H}^{t-1}$ is calculated as

\mathcal{H}^{t-1}=\omega^{t-1}\circ\mathcal{P}^{t-1},

(6)

where $\circ$ is the Hadamard product, $\mathcal{P}^{t-1}$ is the class-agnostic center heatmap produced from the head network (we use the same head network as CenterNet [43]).

V Experiments

V-A Experimental Settings

Datasets. Our method is validated on two public datasets, PMOT [8] and MOT [9].

1.

The PMOT dataset is a synthetic dataset specifically designed for phytoplankton tracking tasks, encompassing a total of 21 categories. It simulates video footage of plankton observed in flowing pipes under a microscope, making it the first synthetic dataset of its kind. Tracking phytoplankton in real-world environments involves much more complex scenes compared to laboratory settings. Therefore, we expanded our dataset to simulate the presence of noise found in real-world scenarios. The phytoplankton dataset we use is derived from modifications to the PMOT2023 dataset. We integrated data collected over the years from our laboratory, selecting the most suitable portions for inclusion. Based on this, we applied transformations such as occlusions, gray processing, blurring and salt-and-pepper noise to simulate complex underwater environmental conditions. Depending on the degree of added noise, we classified the entire phytoplankton dataset into three difficulty levels: no noise for easy difficulty; a lower degree of noise for medium difficulty; and a higher degree of noise for hard difficulty. As shown in Fig. 9. This allowed us to validate the model’s generalization capabilities under more realistic conditions. The entire dataset consists of 9 original video segments and 63 noise-added video segments, with the length of the transformed videos matching that of the original videos. The composition of the dataset is shown in Table V-A. The ablation studies were evaluated using the phytoplankton dataset.
2.

To fully demonstrate the effectiveness of our method, we conducted experiments on the MOT dataset, a widely recognized benchmark in the field of multi-object tracking. The MOT dataset is designed to present a range of complex scenarios, including busy streets, shop** malls, parks, and other environments, and is provided by the MOT Challenge organization. The MOT17 dataset includes annotations for the training set but does not provide annotations for the test set; performance metrics can only be obtained by uploading the tracking results to the MOTChallenge website for evaluation. In our experiments with the MOT17 dataset, we performed an overall algorithm evaluation using only the training set. Specifically, we divided the training set into two halves: one half was used for training, and the other half was used for testing.