FlightPatchNet: Multi-Scale Patch Network with Differential Coding for Flight Trajectory Prediction

Lan Wu, Xuebin Wang , Ruijuan Chu, Guangyi Liu, Yingchun Chen
**g Zhang , Linyu Wang
Information Engineering University
{lanwundsc,xuebinwang2024,liuguangyi1982,springer_2002}@163.com
{linyu18,zhurj18}@mails.jlu.edu.cn,[email protected] coauthorcorresponding authorcorresponding author

Abstract

Accurate multi-step flight trajectory prediction plays an important role in Air Traffic Control, which can ensure the safety of air transportation. Two main issues limit the flight trajectory prediction performance of existing works. The first issue is the negative impact on prediction accuracy caused by the significant differences in data range. The second issue is that real-world flight trajectories involve underlying temporal dependencies, and existing methods fail to reveal the hidden complex temporal variations and only extract features from one single time scale. To address the above issues, we propose FlightPatchNet, a multi-scale patch network with differential coding for flight trajectory prediction. Specifically, FlightPatchNet first utilizes the differential coding to encode the original values of longitude and latitude into first-order differences and generates embeddings for all variables at each time step. Then, a global temporal attention is introduced to explore the dependencies between different time steps. To fully explore the diverse temporal patterns in flight trajectories, a multi-scale patch network is delicately designed to serve as the backbone. The multi-scale patch network exploits stacked patch mixer blocks to capture inter- and intra-patch dependencies under different time scales, and further integrates multi-scale temporal features across different scales and variables. Finally, FlightPatchNet ensembles multiple predictors to make direct multi-step prediction. Extensive experiments on ADS-B datasets demonstrate that our model outperforms the competitive baselines. Code is available at: https://github.com/FlightTrajectoryResearch/FlightPatchNet.

1 Introduction

Flight Trajectory Prediction (FTP) is an essential task in the Air Traffic Control (ATC) procedure, which can be applied to various scenarios such as air traffic flow prediction Abadi6878453 ; LIN2019105113 , aircraft conflict detection AdeepGaussian , and arrival time estimation WANG2018280 . Accurate FTP can ensure the safety of air transportation and improve real-time airspace management LIN8846596 ; Shi9136843 . Generally, FTP tasks can be divided into three categories: long-term Jeong8190764 ; Runle7999617 , medium-term Yuan7554828 ; Chen2016ShortmediumtermPF , and short-term huang2017short ; duan2018unified . Among them, short-term trajectory prediction has the greatest impact on ATC and is increasingly in demand for air transportation. In this paper, we mainly focus on the short-term FTP task, which aims to predict future flight trajectories based on historical observations.

In ATC domain, multi-step trajectory prediction can provide more practical applications than single-step prediction LIN8846596 . It can be divided into Iterated Multi-Step (IMS) prediction and Direct Multi-Step (DMS) prediction. IMS-based methods Yan6972562 ; Zhang2023FlightTP ; Guo2023FlightBERT make multi-step prediction recursively, which learns a single-step model and iteratively applies the predicted values as observations to forecast the next trajectory point. Due to the error accumulation problem and the step-by-step prediction scheme, this type of methods usually fails in multi-step prediction and cannot meet the real-time requirement. By contrast, DMS-based methods Guo2023FlightBERT++ ; wuhan2023bi can directly generate future trajectory points at once, which can tackle the error accumulation problem and improve prediction efficiency. Therefore, this paper performs the short-term FTP task in DMS way.

However, two main issues are not well addressed in existing works Yan6972562 ; Zhang2023FlightTP ; Guo2023FlightBERT++ ; wuhan2023bi , limiting the trajectory prediction performance. The first issue is the negative impact on prediction accuracy caused by the significant differences in data range. In general, longitude and latitude are denoted by degree but altitude is by meter. Since one degree is approximately 111 kilometers, the data range of longitude and latitude are extremely different from that of altitude. Some previous works CNN-LSTM9145522 ; LSTM8489734 directly utilized normalization algorithms to scale variables into the same range, e.g., from 0 to 1. However, the actual prediction errors may be unacceptable for FTP task when evaluated in raw data range. FlightBERT Guo2023FlightBERT and FlightBERT++ Guo2023FlightBERT++ proposed binary encoding representation to convert variables from rounded decimal numbers to binary vectors, which regards the FTP task as multiple binary classification problem. Although BE representation can avoid the vulnerability caused by normalization algorithms, one serious limitation is introduced: a high bit misclassification in binary will lead to a large absolute error in decimal.

The second issue is that real-world flight trajectories involve underlying temporal dependencies, and existing methods Shi9136843 ; Guo2023FlightBERT ; Guo2023FlightBERT++ fail to reveal the hidden complex temporal variations and only extract features from one single time scale. As shown in Figure 1, the original series of longitude and latitude are over-smoothing and obscure abundant temporal variations, which can be clearly observed from first-order difference series. Besides, the temporal variation patterns of longitude and latitude are quite distinct from those of altitude which have an obvious global trend but suffer from intense local fluctuations. For example, slight turbulence can exert a significant influence on the altitude but produce a negligible effect on the longitude and latitude. A single-scale model cannot simultaneously capture both local temporal details and global trends. This calls for powerful multi-scale temporal modeling capacity. Furthermore, if the learned multi-scale temporal patterns are simply aggregated, the model is failed to focus on contributed patterns. Meanwhile, it is essential to explore relationships across variables, e.g., the velocity at current time step directly effects the location at next time step. Thus, scale-wise correlations and inter-variable relationships should be fully considered when modeling the multi-scale temporal patterns.

Based on above analysis, this paper proposes a multi-scale patch network with differential coding (FlightPatchNet) to address above issues. Specifically, we utilize the differential coding to encode the original values of longitude and latitude into first-order differences and retain original values of other variables as inputs. Due to the dependencies between nearby and distant time steps, we introduce global temporal embedding to explore the correlations between time steps. Then, a multi-scale patch network is proposed to enable the ability of powerful and complete temporal modeling. The multi-scale patch network divide the trajectory series into patches of different sizes, and exploits stacked patch mixer blocks to capture global trends across patches and local details within patches. To further promote the multi-scale temporal modeling capacity, a multi-scale aggregator is introduced to capture scale-wise correlations and inter-variable relationships. Finally, FlightPatchNet ensembles multiple predictors to make direct multi-step forecasting, which can benefit from complementary multi-scale temporal features and improve the generalization ability. The main contributions are summarized as follows:

•

We utilize differential coding to effectively reduce the differences in data range and reveal the underlying temporal variations in real-world flight trajectories. Our empirical studies show that using differential values of longitude and latitude can greatly improve prediction accuracy.
•

We propose FlightPatchNet to fully explore underlying multi-scale temporal patterns. A multi-scale patch network is designed to capture inter- and intra-patch dependencies under different time scales, and integrate multi-scale temporal features across scales and variables. To our knowledge, this is the first work that introduces multi-scale modeling for flight trajectory prediction.
•

We conduct extensive experiments on a real-world dataset. The experiment results demonstrate that our proposed model significantly outperforms the most competitive baselines.

2 Related Work

Kinetics-and-Aerodynamics Methods

The Kinetics-and-Aerodynamics methods thipphavong2013adaptive ; soler2015multiphase ; benavides2014implementation ; tang20154d divide the entire flight process into several phases, and establish motion equations for each phase to formulate the flight status. For example, wang2009prediction adopted basic flight models to construct horizontal, vertical, and velocity profiles based on the characteristics of different flight phases. Zhi**g7867472 combined the dynamics-and-kinematics models and grayscale theory to predict future trajectories. The grayscale theory can address the parameter missing problem in dynamics-and-kinematics models and improve the prediction performance. Due to numerous unknown and time-varying flight parameters of aircraft, these fixed-parameter methods cannot accurately describe the flight status, leading to a poor performance and limited application scenarios.

State-Estimation Methods

The Kalman Filter and its variants Yan6972562 ; wang20144d ; xi2008simulation are the typical single-model state-estimation algorithms for FTP task, which applies the predefined state equations to estimate the next flight status based on the current observation. For example, xi2008simulation applied the Kalman Filter to track discrete flight trajectories by calculating a continuous state transition matrix. However, single-model algorithms cannot adapt to the complex ATC environment. To address this issue, Interactive Multi Model algorithms hwang2003flight ; li2005survey have been proposed and successfully applied for trajectory analysis. Although multi-model algorithms can achieve better prediction performance, the computational complexity is high and cannot satisfy the real-time requirement.

Deep Learning Methods

With the rapid development of deep learning, there has been a surge of deep learning methods for FTP task Guo2023FlightBERT ; Guo2023FlightBERT++ ; pang2022bayesian ; xu2021multi ; Sahadevan . These learning-based approaches can extract high-dimensional features from raw data, which have achieved a more magnificent performance compared to previous methods. For example, Sahadevan used a Bi-directional Long-Short-Term-Memory (Bi-LSTM) network to explore both forward and backward dependencies in the sequential trajectory data. FlightBERT Guo2023FlightBERT employed binary encoding to represent the attributes of the trajectory points and considered the FTP task as a multi binary classification problem. However, these works predict the future trajectory recursively and suffer from serious error accumulation. Recently, FlightBERT++ Guo2023FlightBERT++ has been introduced for DMS prediction, which considers the prior horizon information and directly predicts the differential values between adjacent points.

3 Methodology

FTP task can be formulated as Multivariate Time Series (MTS) forecasting problem. Formally, given a sequence of historical observations $\mathbf{X}=\left\{\mathbf{x}_{1},...,\mathbf{x}_{L}\right\}\in\mathbb{R}^{C% \times L}$ , where $C$ is the state dimension, $L$ is the look-back window size and $\mathbf{x}_{t}\in\mathbb{R}^{C\times 1}$ denotes the flight state at time step $t$ . The task is to predict future $T$ time steps $\hat{\mathbf{Y}}=\left\{\hat{\mathbf{x}}_{L+1},...,\hat{\mathbf{x}}_{L+T}% \right\}\in\mathbb{R}^{C^{\prime}\times T}$ , where $C^{\prime}$ is the predicted state dimension. Specifically, in this work, the flight state $\mathbf{x}_{t}$ represents longitude, latitude, altitude, and velocities along the previous three dimensions, i.e., $\mathbf{x}_{t}=(Lon_{t},Lat_{t},Alt_{t},Vx_{t},Vy_{t},Vz_{t})^{\top}$ .

The overall architecture of FlightPatchNet is shown in Figure 2, which consists of Global Temporal Embedding, Multi-Scale Patch Network, and Predictors. Global Temporal Embedding first utilizes differential coding to transform the original values of longitude and latitude into first-order differences and embeds all variables of the same time step into temporal tokens. A global temporal attention is then introduced to capture the inherent dependencies between different tokens. Multi-Scale Patch Network is proposed to serve as the backbone which is composed of stacked patch mixer blocks and a multi-scale aggregator. Stacked patch mixer blocks divide trajectory series into patches of different sizes from large scale to small scale. Based on divided patches, each patch mixer block exploits a patch encoder and decoder to capture inter- and intra-patch dependencies, endowing our model with powerful temporal modeling capability. To further integrate multi-scale temporal patterns, a multi-scale aggregator is incorporated into the network to capture scale-wise correlations and inter-variable relationships. Predictors provide direct multi-step trajectory forecasting and each predictor is a fully connected network. All the predictor results are aggregated to generate the final prediction.

3.1 Global Temporal Embedding

Differential Coding

In the context of WGS84 Coordinate System, the longitude and latitude are limited to the intervals $[0,\pm 180^{\circ}]$ and $[0,\pm 90^{\circ}]$ respectively, while the altitude can span from 0 up to tens of thousands of meters. The significant differences of data range caused by physical units may impair the trajectory prediction performance. Generally, normalization algorithms are applied to address this issue. However, the normalized prediction errors should be transformed into raw data range to evaluate the actual performance. For example, if the absolute prediction error of longitude is $10^{-4}$ after using Min-Max normalization algorithm, the actual prediction error is 0.036° (approximate 4000 meters), which is unacceptable for FTP task. Moreover, as shown in Figure 1, the original series of longitude and latitude are over-smoothing and only reflect the overall flight trend over a period. If temporal patterns are learned from the original values of longitude and latitude, the model is failed to explore the implicit semantic information and cannot focus on short-term temporal variations in flight trajectories.

To address the above issues, we utilize first-order differences for longitude and latitude while original values for other variables, then the differential values are transformed into meters. This process can be formulated as:

\left\{\begin{aligned} \Delta_{\mathit{Lon}}&=(\mathit{Lon}_{t}-\mathit{Lon}_{% t-1})\times\frac{\pi R}{180}cos\theta(m)\\ \Delta_{\mathit{Lat}}&=(\mathit{Lat}_{t}-\mathit{Lat}_{t-1})\times\frac{\pi R}% {180}(m)\\ \end{aligned}\right.

(1)

where $\theta$ is the latitude at the time step $t$ and $R$ is the radius of the earth. By using differential coding for longitude and latitude, the differences in data range are effectively reduced. For example, in our dataset, the range of latitude in original data is about $[-46^{\circ},70^{\circ}]$ and that in differential data is about $[-3860m,3860m]$ , which spans a similar data range as the altitude. Besides, compared to the original sequences, the differential series can explicitly reflect the underlying temporal variations, which is essential for short-term temporal modeling. Note that we utilize the original values of altitude as inputs rather than differential values. One important reason is that altitude is more susceptible to noise, failing to reflect the actual temporal variations. To this end, the flight state at time step $t$ becomes $\mathbf{x}_{t}=(\Delta_{\mathit{Lon}},\Delta_{\mathit{Lat}},Alt_{t},Vx_{t},Vy_% {t},Vz_{t})^{\top}$ .

Global Temporal Attention

Given the trajectory series $\mathbf{X}\in\mathbb{R}^{C\times L}$ , we first project flight state at each time step into $d$ dimension to generate temporal embeddings $\mathbf{T}^{0}\in\mathbb{R}^{L\times d}$ . Then, we apply multi-head self-attention (MSA) Vaswani2017AttentionIA on the dimension $L$ to capture the dependencies across all time steps. After attention, the embedding at each time step is enriched with temporal information from other time steps. This process is formulated as:

$\displaystyle\mathbf{T}^{0}$	$\displaystyle=\mathit{TimeEmbedding}(\mathbf{X}^{\top})$	(2)
$\displaystyle\mathbf{T}^{i}$	$\displaystyle=\mathit{LayerNorm}(\mathbf{T}^{i-1}+\mathit{MSA}(\mathbf{T}^{i-1% }),i=1,\dots,l$
$\displaystyle\mathbf{T}^{i}$	$\displaystyle=\mathit{LayerNorm}(\mathbf{T}^{i}+FC(\mathbf{T}^{i}),i=1,\dots,l$
$\displaystyle\mathbf{Z}$	$\displaystyle={(\mathit{Linear}(\mathbf{T}^{l}}))^{\top}$

where $l$ is the number of attention layers, $LayerNorm$ denotes the layer normalization ba2016layer which has been widely adopted to address non-stationary issues, $MSA$ is the multi-head self-attention layer, $FC$ denotes a fully-connected layer and $Linear$ projects the embedding of each time step to dimension $C$ , i.e., $\mathbb{R}^{d}\rightarrow\mathbb{R}^{C}$ .

3.2 Multi-Scale Patch Network

Considering different temporal patterns prefer diverse time scales, the multi-scale patch network first utilizes a stack of $K$ patch mixer blocks to capture underlying temporal patterns from large scale to small scale. A large time scale can reflect the slow-varying flight trends, while a smaller scale can retain fine-grained local details. To further promote the collaboration of diverse temporal features, a multi-scale aggregator is introduced to consider the contributed scales and dominant variables. Such a multi-scale network equips our model with the powerful and complete temporal modeling capability, and helps preserve all kinds of multi-scale characteristics.

3.2.1 Patch Mixer Block

Patching

Only considering one single time step is insufficient for FTP task, since it contains limited semantic information and cannot accurately reflect the flight trajectory variations. Inspired by 2021An ; Yuqietal-2023-PatchTST , the trajectory representation $\mathbf{Z}\in\mathbb{R}^{C\times L}$ is segmented into several non-overlap** patches along the temporal dimension, generating a sequence of patches $\mathbf{Z}_{p}\in\mathbb{R}^{C\times P\times N}$ , where $P$ is the length of each patch, $N$ represents the number of patches, and $N=\left\lceil\frac{L}{P}\right\rceil$ . The patching process is formulated as:

{\mathbf{Z}}_{p}={Reshape}({ZeroPadding}(\mathbf{Z}))

(3)

where $ZeroPadding(\cdot)$ refers to padding series with zeros in the beginning to ensure the length is divisible by $P$ .

Patch Encoder-Decoder

Based on the divided patches $\mathbf{Z}_{p}$ , we utilize a patch encoder and decoder to capture temporal features in flight trajectories. Specifically, the patch encoder aims to capture the inter-patch features (i.e., the global correlations across patches) and intra-patch features (i.e., the local details within patches). After that, these features are reconstructed to the original dimension by the patch decoder. Due to the superiority of linear models for MTS chen2023tsmixer ; zeng2023transformers , the patch encoders and decoders are based on pure multi-layer perceptron (MLP) for temporal modeling.

As illustrated in Figure 3, a patch encoder consists of an inter-patch MLP, an intra-patch MLP, and a linear projection. Each MLP has two fully connected layers, a GELU non-linearity layer and a dropout layer with a residual connection.

Given the patch-divided series $\mathbf{Z}_{p}$ , an inter-patch MLP performs on the dimension $N$ to capture the dependencies between different patches, which maps $\mathbb{R}^{N}\rightarrow\mathbb{R}^{N}$ to obtain the inter-patch mixed representation $\mathbf{N}_{inter}\in\mathbb{R}^{C\times P\times N}$ :

\mathbf{N}_{inter}=\mathbf{Z}_{p}+Dropout(FC(\sigma(FC(\mathbf{Z}_{p}))))

(4)

where $\sigma$ denotes a GELU non-linearity layer, $Dropout$ denotes a dropout layer and $\mathbf{N}_{inter}$ reflects the global correlations across patches. After that, an intra-patch MLP performs on the dimension $P$ to capture the dependencies across different time steps within patches, which maps $\mathbb{R}^{P}\rightarrow\mathbb{R}^{P}$ to obtain the intra-patch mixed representation $\mathbf{N}_{intra}\in\mathbb{R}^{C\times N\times P}$ :

\mathbf{N}_{intra}=\mathbf{N}_{inter}^{\top}+Dropout(FC(\sigma(FC(\mathbf{N}_{% inter}^{\top}))))

(5)

where $\mathbf{N}_{intra}$ reflects the local details between different time steps within patches. Then, we perform a linear projection on $\mathbf{N}_{intra}^{\top}$ to obtain the final inter- and intra-patch mixed representation $\mathbf{E}$ $\in\mathbb{R}^{C\times P\times 1}$ :

\mathbf{E}=\mathit{Linear}(\mathbf{N}_{intra}^{\top})

(6)

After such patch encoding process, the correlations between nearby time steps within patches and distant time steps across patches are finely explored. Then, we utilize a patch decoder to reconstruct the original sequence. A patch decoder comprises the same components as the encoder in a reverse order, which is formulated as follows:

$\displaystyle\mathbf{D}$	$\displaystyle=Linear(\mathbf{E})$	(7)
$\displaystyle\mathbf{P}_{intra}$	$\displaystyle=\mathbf{D}^{\top}+Dropout(FC(\sigma(FC(\mathbf{D}^{\top}))))$
$\displaystyle\mathbf{P}$	$\displaystyle=\mathbf{P}_{intra}^{\top}+Dropout(FC(\sigma(FC(\mathbf{P}_{intra% }^{\top}))))$

where $Linear$ makes a dimensional projection to obtain $\mathbf{D}\in\mathbb{R}^{C\times P\times N}$ for reconstructing the original sequence, $\mathbf{P}_{intra}\in\mathbb{R}^{C\times N\times P}$ is the reconstructed intra-patch mixed representation, and $\mathbf{P}\in\mathbb{R}^{C\times P\times N}$ is the final reconstructed intra- and inter-patch mixed representation.

3.2.2 Multi-Scale Aggregator

To enable the ability of more complete multi-scale modeling, we introduce a multi-scale aggregator to integrate different temporal patterns. It contains two components: scale fusion and channel fusion. Scale fusion can figure out critical time scales and capture the scale-wise correlations, while channel fusion can discover dominant variables effecting temporal variations and explore the inter-variable relationships. These two components work together to help the model learn a robust multi-scale representation and improve generalization ability.

Given the $K$ scale-specific temporal representations $\{\mathbf{P}_{1},\mathbf{P}_{2},\dots,\mathbf{P}_{K}\}$ , we first stack them and rearrange the data to combine the three dimensions of channel size $C$ , patch size $P$ and patch quantity $N$ , resulting in $\mathbf{S}^{0}\in\mathbb{R}^{K\times(C\times L)}$ , where $L=P\times N$ . Then we apply MSA on the scale dimension $K$ to learn the importance of contributed time scales. This process is formulated as:

$\displaystyle\mathbf{S}^{0}$	$\displaystyle=Reshape(Stack(\mathbf{P}_{1},\mathbf{P}_{2},\dots,\mathbf{P}_{K}))$	(8)
$\displaystyle\mathbf{S}^{i}$	$\displaystyle=LayerNorm(\mathbf{S}^{i-1}+MSA(\mathbf{S}^{i-1}),i=1,\dots,l$
$\displaystyle\mathbf{S}^{i}$	$\displaystyle=LayerNorm(\mathbf{S}^{i}+FC(\mathbf{S}^{i}),i=1,\dots,l$

where $\mathbf{S}^{l}$ is the final multi-scale fusion representation within variables. Inspired by liu2023itransformer , we consider each variable as a token and apply MSA to explore dependencies between different variables. We first reshape the $\mathbf{S}^{l}$ to get $\mathbf{C}^{0}$ $\in\mathbb{R}^{C\times(K\times L)}$ and perform multi-head self-attention on the channel dimension $C$ to identify dominant variables. This process is simply formulated as follows:

$\displaystyle\mathbf{C}^{0}$	$\displaystyle=Reshape(\mathbf{S}^{l})$	(9)
$\displaystyle\mathbf{C}^{i}$	$\displaystyle=LayerNorm(\mathbf{C}^{i-1}+\mathit{MSA}(\mathbf{C}^{i-1})),i=1,% \dots,l$
$\displaystyle\mathbf{C}^{i}$	$\displaystyle=\mathit{LayerNorm}(\mathbf{C}^{i}+FC(\mathbf{C}^{i}),i=1,\dots,l$
$\displaystyle\mathbf{H}$	$\displaystyle=Reshape(\mathbf{C}^{l})$

where $\mathbf{H}\in\mathbb{R}^{C\times L\times K}$ is the final multi-scale representation which involves cross-scale correlations and inter-variable relationships.

3.3 Direct Multi-Step Prediction

We ensembles $K$ predictors to directly obtain the future flight trajectory series, which can exploit complementary information from different temporal patterns. The objective of our model is to predict the differential values of longitude and latitude relative to the last observation, and the raw absolute values of altitude, i.e., $\hat{\mathbf{Y}}=\left\{\hat{\mathbf{x}}_{L+1},...,\hat{\mathbf{x}}_{L+T}\right\}$ , where $\hat{\mathbf{x}}_{L+i}=(\hat{\Delta}^{\mathit{Lon}}(L+i,L),\hat{\Delta}^{% \mathit{Lat}}(L+i,L),\hat{Alt}_{L+i})^{\top}$ for $i=1,\dots,T$ . We split the final multi-scale representation $\mathbf{H}\in\mathbb{R}^{C\times L\times K}$ into a sequence $\left\{\mathbf{H}_{*,1},\mathbf{H}_{*,2},\dots,\mathbf{H}_{*,K}\right\}$ , where $\mathbf{H}_{*,i}\in\mathbb{R}^{C\times L}$ for $i=1,\dots,K$ , and feed each $\mathbf{H}_{*,i}$ to a predictor. Each predictor has two MLPs. The first $MLP_{C_{i}}$ transforms the input channel $C$ into the output channel $C^{\prime}$ , and the second $MLP_{T_{i}}$ projects the historical input sequence $L$ to the prediction horizon $T$ .

	$\displaystyle\hat{\mathbf{Y}_{i}}=$	$\displaystyle MLP_{T_{i}}(MLP_{C_{i}}(\mathbf{H}_{*,i}))$		(10)
	$\displaystyle\hat{\mathbf{Y}}=$	$\displaystyle\sum\limits_{i=1}^{K}\hat{\mathbf{Y}_{i}}$		(10)

Finally, all the predictor results are aggregated to generate the final prediction, which can enhance the stability and generalization of our model.

4 Experiments

4.1 Dataset and Experimental Setup

Datasets

To evaluate the performance of FlightPatchNet, we conduct extensive experiments on ADS-B data provided by OpenSky ¹¹1https://opensky-network.org/datasets/states/ from 2020 to 2022. In this paper, six key attributes are extracted from the original data, including longitude, latitude, altitude, and velocity in x, y, z dimensions. The dataset is divided into three parts for training, validation, and testing with a ratio of 8:1:1.

Baselines and Setup

We compare our model with five competitive models, including four IMS-based models: LSTM LSTM8489734 , Bi-LSTM Sahadevan , CNN-LSTM CNN-LSTM9145522 , FlightBERT Guo2023FlightBERT ; one DMS-based model: FlightBERT++ Guo2023FlightBERT++ . These models have covered mainstream deep learning architectures, including Transformer(FlightBERT, FlightBERT++), CNN(CNN-LSTM), and RNN(LSTM, Bi-LSTM, CNN-LSTM), which help to provide a comprehensive comparison. For fairness, all the models are following the same experimental setup with lookback window $L=60$ and prediction horizon $T\in\{1,3,9,15\}$ . Our model is trained with MSE loss, using the Adam optimizer kingma2014adam . We adopt the Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE) as evaluation metrics. More details about dataset, baselines, implementation and hyper-parameters are shown in Appendix A.

4.2 Main results

Comprehensive flight prediction results are demonstrated in Table 1 (see Appendix B.2 for error bar). FlightPatchNet achieves the most outstanding performance across various prediction lengths for longitude and latitude in terms of both MAE and RMSE, while does not achieve the optimal results for altitude compared with other strong baselines such as FlightBERT++. For simplification, we consider prediction horizon $T=15$ and compare our model with the second best. FlightPatchNet achieves an overall 18.62% reduction on MAE and 49.63% reduction on RMSE for longitude, and 35.31% reduction on MAE and 44.80% reduction on RMSE for latitude. For the prediction performance of altitude, FlightBERT++ outperforms our model by 45.51 meters reduction on MAE but has a large RMSE which may caused by high-bit errors in the prediction. FlightPatchNet obtains the smallest RMSE results for all variables, indicating that our model can provide a more robust and stable prediction. Furthermore, as the prediction horizon increases, DMS-based models greatly outperform IMS-based models which suffer from serious performance degradation due to error accumulation. Note that longitude, latitude and altitude are combined together to determine the position of an aircraft. Poor prediction performance on any variable is intolerable and meaningless for short-term FTP task. Thus, FlightPatchNet achieves the most competitive performance in general.

Table 1: Flight trajectory prediction results. A lower MAE or RMSE represents a better prediction. The prediction horizon

T\in\left\{1,3,9,15\right\}

and look-back window size

L=60

for all experiments. The best results are highlighted in bold and the second best are underlined. Note that

0.00001^{\circ}

is about 1m.

	Model	Metric	Lon(^∘)				Lat(^∘)				Alt(m)
	Model	Metric	1	3	9	15	1	3	9	15	1	3	9	15
IMS	LSTM	MAE	0.00056	0.00427	0.01747	0.03132	0.00049	0.00493	0.02116	0.03717	92.27	159.30	549.55	882.86
	LSTM	RMSE	0.00095	0.00691	0.02597	0.04578	0.00089	0.00740	0.02956	0.05143	142.05	233.39	763.84	768.45
	Bi-LSTM	MAE	0.00155	0.00747	0.02319	0.03890	0.00137	0.00824	0.02711	0.04404	432.50	761.50	1648.68	2006.21
	Bi-LSTM	RMSE	0.00202	0.01124	0.03387	0.05532	0.00181	0.01142	0.03639	0.05982	563.74	953.37	2132.91	2420.74
	CNN-LSTM	MAE	0.00139	0.00700	0.02282	0.04149	0.00131	0.00801	0.02623	0.05139	520.03	746.67	1569.68	1136.80
	CNN-LSTM	RMSE	0.00240	0.01033	0.03263	0.05981	0.00212	0.01130	0.03559	0.07353	1176.96	926.40	1936.63	1658.53
	FlightBERT	MAE	0.00123	0.00241	0.01162	0.02407	0.00088	0.00158	0.00963	0.01238	24.67	35.67	78.58	134.29
	FlightBERT	RMSE	0.00241	0.00526	0.02189	0.03969	0.00154	0.00286	0.01904	0.03093	234.17	272.59	384.22	462.28
DMS	FlightBERT++	MAE	0.00173	0.00317	0.00871	0.01187	0.00085	0.00210	0.00612	0.01048	9.39	21.89	47.84	78.46
	FlightBERT++	RMSE	0.00360	0.00659	0.01846	0.03131	0.00148	0.00425	0.00959	0.02127	175.29	167.16	327.93	384.18
	FlightPatchNet (Ours)	MAE	0.00048	0.00153	0.00546	0.00966	0.00032	0.00105	0.00381	0.00678	13.34	32.65	78.57	123.97
	FlightPatchNet (Ours)	RMSE	0.00087	0.00233	0.00885	0.01577	0.00064	0.00175	0.00652	0.01174	123.78	121.48	174.63	244.34

4.3 Effectiveness of Differential Coding

The results in Table 2 show that using differential coding for longitude and latitude can significantly improve their prediction performance but slightly decrease the accuracy of altitude. The differential coding can reveal the temporal variations of longitude and latitude, which helps the temporal modeling in flight trajectories. However, the variations of altitude in original series may come from unexpected noise. FlightPatchNet has strong modeling capacity for temporal variations and tends to focus on the noise points during altitude prediction, leading to a large bias towards the ground truth. Further analysis are presented in Appendix C.

Table 2: Flight trajectory prediction results for longitude and latitude in original data and differential data when prediction horizon

T=15

. The best results are highlighted in bold. Note that altitude and velocities are always in original data.

Models	Diff	Lon(^∘)		Lat(^∘)		Alt(m)
Models	Diff	MAE	RMSE	MAE	RMSE	MAE	RMSE
LSTM	$\checkmark$	0.03132	0.04578	0.03717	0.05143	882.86	1205.78
LSTM	$\times$	0.82230	1.20424	0.12008	2.44136	768.45	1053.21
Bi-LSTM	$\checkmark$	0.03890	0.05532	0.04404	0.05982	2006.21	2420.74
Bi-LSTM	$\times$	1.71433	2.43607	0.19014	0.27621	2091.19	2665.51
CNN-LSTM	$\checkmark$	0.04149	0.05981	0.05139	0.07353	1136.80	1658.53
CNN-LSTM	$\times$	8.59512	23.07600	1.95957	8.15418	1638.02	2113.49
FlightPatchNet (ours)	$\checkmark$	0.00966	0.01577	0.00678	0.01174	123.97	244.34
FlightPatchNet (ours)	$\times$	0.19348	0.26243	0.05385	0.07457	60.63	169.57

4.4 Effectiveness of Multi Scales

Table 3: The flight trajectory prediction results of single scales and multi scales. The best results are highlighted in bold.

		Lon(^∘)				Lat(^∘)				Alt(m)
Patch Size	Metric	1	3	9	15	1	3	9	15	1	3	9	15
2	MAE	0.00050	0.00156	0.00564	0.01038	0.00034	0.00105	0.00388	0.00711	16.50	23.39	93.28	106.63
2	RMSE	0.00091	0.00248	0.00907	0.01675	0.00065	0.00176	0.00654	0.01218	122.41	105.19	202.30	225.08
6	MAE	0.00052	0.00158	0.00583	0.01027	0.00036	0.00106	0.00402	0.00715	12.72	22.57	120.51	127.15
6	RMSE	0.00091	0.00251	0.00937	0.01656	0.00067	0.00178	0.00670	0.01227	106.48	106.58	226.89	247.28
10	MAE	0.00049	0.00162	0.00570	0.01019	0.00033	0.00108	0.00382	0.00700	12.16	35.77	107.07	178.91
10	RMSE	0.00088	0.00256	0.00912	0.01652	0.00064	0.00180	0.00654	0.01201	126.31	101.96	216.02	306.40
20	MAE	0.00054	0.00163	0.00562	0.01032	0.00036	0.00110	0.00391	0.00707	15.58	25.57	95.56	152.48
20	RMSE	0.00091	0.00257	0.00907	0.01660	0.00066	0.00182	0.00659	0.01211	99.66	106.19	202.99	286.20
30	MAE	0.00050	0.00161	0.00580	0.01065	0.00033	0.00108	0.00399	0.00716	10.35	24.75	75.23	144.55
30	RMSE	0.00090	0.00256	0.00935	0.01719	0.00065	0.00180	0.00676	0.01220	104.48	99.18	182.47	273.72
30,20,10,6,2	MAE	0.00048	0.00153	0.00546	0.00966	0.00032	0.00105	0.00381	0.00678	13.34	32.65	78.57	123.97
30,20,10,6,2	RMSE	0.00087	0.00233	0.00885	0.01577	0.00064	0.00175	0.00652	0.01174	129.65	121.78	174.63	244.34

To investigate the effect of multi-scale modeling, we conduct experiments on single scale for {2,6,10,20,30}. The results in Table 3 illustrate the critical contribution of multi scales to our model. We can observe that different variables prefer distinct time scales. For example, using patch size 10 can obtain the second best prediction performance on longitude and latitude but the worst performance on altitude when prediction horizon $T=15$ . This indicates that longitude, latitude and altitude have distinct temporal patterns, and different scales can extract diverse complementary features, which can be effectively leveraged to obtain competitive and robust prediction results.

4.5 Ablation Study

We conduct ablation studies by removing corresponding modules from FlightPatchNet. Specifically, w/o global temporal attention does not capture the correlations between time steps. w/o scale fusion considers each time scale of equal importance. w/o channel fusion does not explore the relationships between variables. Table 4 shows the contribution of each component.

Table 4: Performance comparisons on ablative variants. The best results are highlighted in bold.

Case	Horizon	Lon(^∘)				Lat(^∘)				Alt(m)
Case	Horizon	1	3	9	15	1	3	9	15	1	3	9	15
w/o global temporal attention	MAE	0.00051	0.00190	0.00667	0.01232	0.00034	0.00132	0.00466	0.00876	20.59	28.15	66.62	112.43
w/o global temporal attention	RMSE	0.00090	0.00308	0.01085	0.02005	0.00066	0.00222	0.00791	0.01486	136.79	112.10	163.78	220.83
w/o scale fusion	MAE	0.00053	0.00169	0.00609	0.01112	0.00035	0.00114	0.00409	0.00759	24.17	32.70	91.06	162.20
w/o scale fusion	RMSE	0.00092	0.00268	0.00975	0.01787	0.00067	0.00188	0.00688	0.01280	130.05	99.54	194.25	282.07
w/o channel fusion	MAE	0.00050	0.00166	0.00573	0.01059	0.00034	0.00112	0.00398	0.00727	20.04	29.22	73.23	131.98
w/o channel fusion	RMSE	0.00089	0.00265	0.00924	0.01707	0.00065	0.00187	0.00667	0.01240	159.47	122.32	174.36	250.07
FlightPatchNet	MAE	0.00048	0.00153	0.00546	0.00966	0.00032	0.00105	0.00381	0.00678	13.34	32.65	78.57	123.97
FlightPatchNet	RMSE	0.00087	0.00233	0.00885	0.01577	0.00064	0.00175	0.00652	0.01174	123.78	121.48	174.63	244.34

Removing the global temporal attention dramatically decreases the multi-step prediction performance, demonstrating the necessary of capturing the correlations between different time steps. Scale fusion can effectively improve the prediction accuracy, indicating that different time scales of trajectory series contain rich and diverse temporal variation information. Channel fusion also improves the model performance, suggesting the importance of exploring relationships between different variables in complex temporal modeling.

5 Conclusion

In this paper, we propose FlightPatchNet, a multi-scale patch network with differential coding for short-term FTP task. The differential coding is leveraged to reduce the significant differences in original data range and reflect the temporal variations in realistic flight trajectories. The multi-scale patch network is designed to explore global trends and local details based on divided patches of different sizes, and integrate scale-wise correlations and inter-variable relationships for complete temporal modeling. Extensive experiments on a real-world dataset demonstrate that FlightPatchNet achieves the most competitive performance.

References

[1] Afshin Abadi, Tooraj Rajabioun, and Petros A. Ioannou. Traffic flow prediction for road transportation networks with limited traffic data. IEEE Transactions on Intelligent Transportation Systems, 16(2):653–662, 2015.
[2] Yi Lin, Jian wei Zhang, and Hong Liu. Deep learning based short-term air traffic flow prediction considering temporal–spatial correlation. Aerospace Science and Technology, 93:105–113, 2019.
[3] Zhengmao Chen, Dongyue Guo, and Yi Lin. A deep gaussian process-based flight trajectory prediction approach and its application on conflict detection. Algorithms, 13(11):293, 2020.
[4] Zhengyi Wang, Man Liang, and Daniel Delahaye. A hybrid machine learning model for short-term estimated time of arrival prediction in terminal manoeuvring area. Transportation Research Part C: Emerging Technologies, 95:280–294, 2018.
[5] Yi Lin, Linjie Deng, Zhengmao Chen, Xi** Wu, Jianwei Zhang, and Bo Yang. A real-time atc safety monitoring framework using a deep learning approach. IEEE Transactions on Intelligent Transportation Systems, 21(11):4572–4581, 2020.
[6] Zhiyuan Shi, Min Xu, and Quan Pan. 4-d flight trajectory prediction with constrained lstm network. IEEE Transactions on Intelligent Transportation Systems, 22(11):7242–7255, 2021.
[7] Donggi Jeong, Min** Baek, and Sang-Sun Lee. Long-term prediction of vehicle trajectory based on a deep neural network. In 2017 International Conference on Information and Communication Technology Convergence (ICTC), pages 725–727. IEEE, 2017.
[8] Du Runle, Liu Jiaqi, Gao Lu, Li Zhifeng, and Zhang Li. Long term trajectory prediction based on advanced guidance law recognition. In 2017 IEEE International Workshop on Metrology for AeroSpace (MetroAeroSpace), pages 456–461. IEEE, 2017.
[9] Chengjue Yuan, Dewei Li, and Dewei Xi. Medium-term prediction of urban traffic states using probability tree. In 2016 35th Chinese Control Conference (CCC), pages 9246–9251. IEEE, 2016.
[10] Dan Chen, Minghua Hu, Ke Han, Honghai Zhang, and Jianan Yin. Short/medium-term prediction for the aviation emissions in the en route airspace considering the fluctuation in air traffic demand. Transportation Research Part D: Transport and Environment, 48:46–62, 2016.
[11] Darong Huang, Zhen** Deng, Ling Zhao, and Bo Mi. A short-term traffic flow forecasting method based on markov chain and grey verhulst model. In 2017 6th Data Driven Control and Learning Systems (DDCLS), pages 606–610. IEEE, 2017.
[12] Peibo Duan, Guoqiang Mao, Weifa Liang, and Degan Zhang. A unified spatio-temporal model for short-term traffic flow prediction. IEEE Transactions on Intelligent Transportation Systems, 20(9):3212–3223, 2018.
[13] Honglei Yan, Genghua Huang, Haiwei Wang, and Rong Shu. Application of unscented kalman filter for flying target tracking. In 2013 International Conference on Information Science and Cloud Computing, pages 61–66, 2013.
[14] Zheng Zhang, Dongyue Guo, Shizhong Zhou, Jianwei Zhang, and Yi Lin. Flight trajectory prediction enabled by time-frequency wavelet transform. Nature Communications, 14(1):5258, 2023.
[15] Dongyue Guo, Edmond Q. Wu, Yuankai Wu, Jianwei Zhang, Rob Law, and Yi Lin. Flightbert: Binary encoding representation for flight trajectory prediction. IEEE Transactions on Intelligent Transportation Systems, 24(2):1828–1842, 2023.
[16] Dongyue Guo, Zheng Zhang, Zhen Yan, Jianwei Zhang, and Yi Lin. Flightbert++: A non-autoregressive multi-horizon flight trajectory prediction framework. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 127–134, 2024.
[17] Han Wu, Yan Liang, Bin Zhou, and Hao Sun. A bi-lstm and autoencoder based framework for multi-step flight trajectory prediction. In 2023 8th International Conference on Control and Robotics Engineering (ICCRE), pages 44–50. IEEE, 2023.
[18] Lan Ma and Shan Tian. A hybrid cnn-lstm model for aircraft 4d trajectory prediction. IEEE Access, 8:134668–134680, 2020.
[19] Zhiyuan Shi, Min Xu, Quan Pan, Bing Yan, and Haimin Zhang. Lstm-based flight trajectory prediction. In 2018 International Joint Conference on Neural Networks (IJCNN), pages 1–8, 2018.
[20] David P Thipphavong, Charles A Schultz, Alan G Lee, and Steven H Chan. Adaptive algorithm to improve trajectory prediction accuracy of climbing aircraft. Journal of Guidance, Control, and Dynamics, 36(1):15–24, 2013.
[21] Manuel Soler, Alberto Olivares, and Ernesto Staffetti. Multiphase optimal control framework for commercial aircraft four-dimensional flight-planning problems. Journal of Aircraft, 52(1):274–286, 2015.
[22] Jose V Benavides, John Kaneshige, Shivanjli Sharma, Ramesh Panda, and Mieczyslaw Steglinski. Implementation of a trajectory prediction function for trajectory based operations. In AIAA Atmospheric Flight Mechanics Conference, page 2198, 2014.
[23] Xin-min Tang, Long Zhou, Zhi-yuan Shen, and Miao Tang. 4d trajectory prediction of aircraft taxiing based on fitting velocity profile. In Aeronautical Computing Technique, volume 45, pages 1–12. 2015.
[24] Chao Wang, Jiuxia Guo, and Zhipeng Shen. Prediction of 4d trajectory based on basic flight models. Journal of southwest jiaotong university, 44(2):295–300, 2009.
[25] Zhi**g Zhou, **liang Chen, Beibei Shen, Zhigang Xiong, Hua Shen, and Fangyue Guo. A trajectory prediction method based on aircraft motion model and grey theory. In 2016 IEEE Advanced Information Management, Communicates, Electronic and Automation Control Conference (IMCEC), pages 1523–1527, 2016.
[26] Taobo Wang. 4d flight trajectory prediction model based on improved kalman filter. Journal of Computer Applications, 34(6), 2014.
[27] Lin Xi, Zhang Jun, Zhu Yanbo, and Liu Wei. Simulation study of algorithms for aircraft trajectory prediction based on ads-b technology. In 2008 Asia Simulation Conference-7th International Conference on System Simulation and Scientific Computing, pages 322–327. IEEE, 2008.
[28] Inseok Hwang, Jesse Hwang, and Claire Tomlin. Flight-mode-based aircraft conflict detection using a residual-mean interacting multiple model algorithm. In AIAA guidance, navigation, and control conference and exhibit, page 5340, 2003.
[29] X Rong Li and Vesselin P Jilkov. Survey of maneuvering target tracking. part v. multiple-model methods. IEEE Transactions on aerospace and electronic systems, 41(4):1255–1321, 2005.
[30] Yutian Pang, Xinyu Zhao, Jueming Hu, Hao Yan, and Yongming Liu. Bayesian spatio-temporal graph transformer network (b-star) for multi-aircraft trajectory prediction. Knowledge-Based Systems, 249, 2022.
[31] Zhengfeng Xu, Weili Zeng, Xiao Chu, and Puwen Cao. Multi-aircraft trajectory collaborative prediction based on social long short-term memory network. Aerospace, 8(4), 2021.
[32] Deepudev Sahadevan, Palanisamy Ponnusamy, Varun P Gopi, and Manjunath K Nelli. Ground-based 4d trajectory prediction using bi-directional lstm networks. Applied Intelligence, 52(14):16417–16434, 2022.
[33] Ashish Vaswani, Noam M. Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Neural Information Processing Systems, 2017.
[34] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
[35] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, and Sylvain Gelly. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021.
[36] Yuqi Nie, Nam H. Nguyen, Phanwadee Sinthong, and Jayant Kalagnanam. A time series is worth 64 words: Long-term forecasting with transformers. In International Conference on Learning Representations, 2023.
[37] Si-An Chen, Chun-Liang Li, Sercan O Arik, Nathanael Christian Yoder, and Tomas Pfister. Tsmixer: An all-mlp architecture for time series forecasting. Transactions on Machine Learning Research, 2023.
[38] Ailing Zeng, Muxi Chen, Lei Zhang, and Qiang Xu. Are transformers effective for time series forecasting? In Proceedings of the AAAI conference on artificial intelligence, volume 37, pages 11121–11128, 2023.
[39] Yong Liu, Tengge Hu, Haoran Zhang, Haixu Wu, Shiyu Wang, Lintao Ma, and Mingsheng Long. itransformer: Inverted transformers are effective for time series forecasting. In The Twelfth International Conference on Learning Representations, 2023.
[40] Diederik P. Kingma and Jimmy Ba. Adam: a method for stochastic optimization. In The Second International Conference on Learning Representations, 2014.

Appendix A Experimental Details

A.1 Dataset Preprocessing and Description

This paper exploits real-world datasets provided by OpenSky from 2020 to 2022 to validate our proposed model. The data preprocessing steps are as follows:

(1) Data Extraction: We extract seven features from the raw data, including timestamp, longitude, latitude, altitude, horizontal flight speed, horizontal flight angle, and vertical speed. The timestamp is used to identify whether the trajectory points are continuous, and the other six features are further processed as inputs to the model.

(2) Data Filtering: Due to many missing values and outliers in the raw dataset, we select 100 consecutive points without missing values as a complete flight trajectory. Then, we adopt the z-score method to find out the outliers. If one flight trajectory contains any outliers, we discard the whole trajectory. The z-score formula is as follows:

z=\frac{(\overline{x}-\mu)}{\sigma-\sqrt{n}}

(1)

where $\overline{x}$ is the value of each feature point, $\mu$ is the mean of each feature, $\sigma$ is the variance of each feature, and $n$ is the number of feature points.

(3) Format Transformation: We transform the horizontal velocity into $V_{x}$ and $V_{y}$ according to the angle, where $V_{x}$ is the velocity in the longitude dimension and $V_{y}$ is the velocity in the latitude dimension. In this way, the features become longitude, latitude, altitude, $V_{x}$ , $V_{y}$ and $V_{z}$ .

(4) Data Segmentation: The dataset is randomly divided into three parts with a ratio of 8:1:1 for training, validation, and testing.

After the above preprocessing, 274,605 flight trajectories are selected into our dataset. The range of longitude, latitude and altitude are $[-179.86396^{\circ},178.82147^{\circ}]$ , $[-46.42435^{\circ},70.32590^{\circ}]$ and $[0,21031.00m]$ , respectively. The interval between two adjacent flight trajectory points is 10 seconds.

A.2 Baseline Methods

We briefly describe the selected 5 competitive baselines as follows:

•

LSTM [19]: Based on two layers of LSTM (with 30 and 60 nodes respectively) to encode each trajectory point, and future trajectories are predicted through a fully connected layer.
•

Bi-LSTM [32]: Based on two layers of Bi-LSTM (with 200 and 50 nodes respectively) to encode each trajectory point, and future trajectories are predicted through a fully connected layer.
•

CNN-LSTM [18]: Based on two layers of one-dimensional CNN (the convolution kernel size is $1\times 3$ ) and two layers of LSTM (with 50 nodes) to encode each trajectory point, and future trajectories are predicted through a fully connected layer.
•

FlightBERT [15]: It utilizes a BE representation to convert the scalar attributes of the flight trajectory into binary vectors, considering the FTP task as a multi binary classification problem. It uses 18, 16, 11 and 11 bits to encode the real values (decimals) of longitude, latitude, altitude and velocities into BE representation respectively.
•

FlightBERT++ [16]: It inherits the BE representation from the FlightBERT and introduces a differential prediction paradigm, which aims to predict the differential values of the trajectory attributes instead of the absolute values.

A.3 Implementation Details

For fairness, all the models follow the same experimental setup with look-back window $L=60$ and prediction horizon $T\in\{1,3,9,15\}$ , which means the observation time is 10 minutes and the forecasting time is 10 seconds, 30 seconds, 1.5 minutes, 2.5 minutes. The patch sizes in multi-scale patch mixer blocks are set to {30, 20, 10, 6, 2}. The dimension of temporal embedding $d$ is 128. For all the MSA in this paper, the head number is 8 and the attention layer $l$ is 3. The learning rate is set as $10^{-4}$ for all experiments. Our method is trained with MSE loss, using the Adam optimizer [40]. The training process is early stopped within 30 epochs. The training would be terminated early if the validation loss does not decrease for three consecutive rounds. The model is implemented in PyTorch 2.2.1 and trained on a single NVIDIA RTX 4080 GPU with 16GB memory.

A.4 Evaluation Metrics

Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE) are exploited to evaluate the proposed model and baselines, which are defined as:

	$\displaystyle MAE$	$\displaystyle=\frac{1}{T}\sum_{i=1}^{T}\|\mathbf{Y}_{i}-\hat{\mathbf{Y}}_{i}\|$
	$\displaystyle RMSE$	$\displaystyle=\sqrt{\frac{1}{T}{\sum_{i=1}^{T}(\mathbf{Y}_{i}-\hat{\mathbf{Y}}% _{i})^{2}}}$

where $\mathbf{Y}_{i}$ , $\hat{\mathbf{Y}}_{i}$ are the ground truth and prediction result for $i$ -th future point, respectively.

Appendix B Additional Experimental Results

B.1 Hyper-Parameter Sensitivity

Number of Scales

We perform experiments on the different number of scales and report the MAE and RMSE results. As shown in Figure 4, we can observe that when the number of scales increases from 2 to 5, the performance of FlightPatchNet is constantly improved. This is because FlightPatchNet can capture diverse global and local temporal patterns under different scales. When the number of scales increases up to 6, the performance starts to deteriorate. This indicates that a certain number of scales is sufficient for temporal modeling, and excessive scales may lead to the overfitting problem.

Number of Attention Layers

We test the number of attention layers in $\{1,2,3,6\}$ for global temporal attention, scale fusion, and channel fusion. The results are shown in Figure 5(a), Figure 5(b) and Figure 5(c). We can observe that when the number of attention layers increases from 1 to 3, the values of MAE and RMSE decrease, demonstrating that our model can better capture the dependencies between different time steps, scale-wise correlations and inter-variable relationships with more layers of attention. When the number of attention layers increases up to 6, the prediction accuracy does not improve. Thus, we choose to use three layers of attention in these parts.

Look-Back Window Size $\mathbf{L}$

Figure 6 demonstrates the MAE and RMSE results of our model with different look-back window sizes. We set the window size $L$ to $\{10,20,30,40,50,60,70,80\}$ . The overall performance of FlightPatchNet is significantly improved as the window size increases from 10 to 60, indicating that FlightPatchNet can thoroughly capture the temporal dependencies from long flight trajectories. Moreover, the performance of altitude fluctuates with the increase of the window size, suggesting that the series of altitude are non-stationary and easily affected by unexpected noise. Thus, we set $L$ as 60 to achieve the overall optimal performance.

Order of Scales

We conduct experiments on the order of patch sizes and report the MAE and RMSE results. As shown in Table 5, we can observe that patch sizes in descending order can effectively improve the prediction performance, indicating that the macro knowledge from coarser scales can guide the temporal modeling of finer scales.

Table 5: The results of flight trajectory prediction with scales in ascending and descending order.

\uparrow

means scales in ascending order and

\downarrow

means scales in descending order. The better results are highlighted in bold.

patch sizes			Lon(^∘)				Lat(^∘)				Alt(m)
patch sizes	Style	Horizon	1	3	9	15	1	3	9	15	1	3	9	15
2,6,10,20,30	$\uparrow$	MAE	0.00098	0.00155	0.00548	0.01008	0.00099	0.00106	0.00385	0.00697	54.02	33.47	79.39	127.51
	$\uparrow$	RMSE	0.00187	0.00241	0.00887	0.01642	0.00131	0.00183	0.00656	0.01197	81.92	110.68	184.54	248.16
	$\downarrow$	MAE	0.00048	0.00153	0.00546	0.00966	0.00032	0.00105	0.00381	0.00678	13.34	32.65	78.57	123.97
	$\downarrow$	RMSE	0.00087	0.00233	0.00885	0.01577	0.00064	0.00175	0.00652	0.01174	129.65	121.78	174.63	244.34
3,4,6,20,40	$\uparrow$	MAE	0.00098	0.00155	0.00556	0.00997	0.00064	0.00105	0.00383	0.00704	39.77	32.15	76.38	124.18
	$\uparrow$	RMSE	0.00188	0.00247	0.00901	0.01631	0.00131	0.00175	0.00655	0.01210	64.86	124.20	177.46	243.46
	$\downarrow$	MAE	0.00097	0.00153	0.00542	0.00963	0.00063	0.00104	0.00369	0.00670	43.46	28.96	79.27	128.13
	$\downarrow$	RMSE	0.00187	0.00245	0.00879	0.01582	0.00130	0.00174	0.00631	0.01167	64.50	115.76	176.96	251.72
3,6,40	$\uparrow$	MAE	0.00048	0.00156	0.00536	0.00994	0.00035	0.00105	0.00370	0.00691	14.96	31.42	79.07	117.43
	$\uparrow$	RMSE	0.00087	0.00248	0.00876	0.01628	0.00065	0.00176	0.00634	0.01193	107.43	118.36	177.40	238.87
	$\downarrow$	MAE	0.00048	0.00153	0.00534	0.00988	0.00033	0.00103	0.00368	0.00685	16.81	31.26	71.96	118.66
	$\downarrow$	RMSE	0.00087	0.00244	0.00870	0.01620	0.00064	0.00173	0.00633	0.01186	145.25	114.33	175.63	236.32

B.2 Error Bar

In this paper, we repeat all the experiments five times. Here we report the standard deviation of our model and the second best model in Table 6.

Table 6: Error bar of our FlightPatchNet and the second best model FlightBERT++.

Model	Horizon	Lon(^∘)		Lat(^∘)		Alt(m)
Model	Horizon	MAE	RMSE	MAE	RMSE	MAE	RMSE
FlightBERT++	1	0.00173 $\pm$ 6.45e-5	0.00360 $\pm$ 8.28e-5	0.00085 $\pm$ 3.33e-5	0.00148 $\pm$ 1.32e-4	9.39 $\pm$ 1.79	175.29 $\pm$ 29.09
	3	0.00317 $\pm$ 2.61e-4	0.00659 $\pm$ 4.35e-5	0.00210 $\pm$ 3.05e-4	0.00425 $\pm$ 1.24e-4	21.89 $\pm$ 5.58	167.16 $\pm$ 46.39
	9	0.00871 $\pm$ 1.74e-4	0.01846 $\pm$ 4.45e-4	0.00612 $\pm$ 6.07e-4	0.00959 $\pm$ 2.19e-4	47.84 $\pm$ 2.87	327.93 $\pm$ 52.84
	15	0.01187 $\pm$ 5.91e-5	0.03131 $\pm$ 5.33e-4	0.01048 $\pm$ 3.69e-4	0.02127 $\pm$ 2.01e-4	78.46 $\pm$ 8.13	384.18 $\pm$ 51.82
FlightPatchNet (Ours)	1	0.00048 $\pm$ 1.24e-5	0.00087 $\pm$ 1.02e-5	0.00032 $\pm$ 1.06e-5	0.00064 $\pm$ 8.45e-6	13.34 $\pm$ 9.43	123.78 $\pm$ 15.13
	3	0.00153 $\pm$ 3.19e-5	0.00233 $\pm$ 5.44e-4	0.00105 $\pm$ 1.19e-5	0.00175 $\pm$ 2.36e-5	32.65 $\pm$ 1.76	121.48 $\pm$ 2.81
	9	0.00546 $\pm$ 1.54e-4	0.00885 $\pm$ 2.16e-4	0.00381 $\pm$ 7.47e-5	0.00652 $\pm$ 7.76e-5	78.57 $\pm$ 2.66	174.63 $\pm$ 6.87
	15	0.00966 $\pm$ 3.65e-4	0.01577 $\pm$ 5.49e-4	0.00678 $\pm$ 2.58e-4	0.01174 $\pm$ 3.53e-4	123.97 $\pm$ 5.72	244.34 $\pm$ 6.91

B.3 Model Complexity

As shown in Table 7, our proposed FlightPatchNet achieves the greatest efficiency and has relatively small parameters compared to other models. For multi-step prediction, the DMS-based models (FlightPatchNet, FlightBERT++) demonstrate significant improvements in computational performance compared to the IMS-based models (FlightBERT, LSTM, Bi-LSTM, CNN-LSTM). In addition, FlightPatchNet is lightweight compared to FlightBERT++ and FlightBERT, which indicates our model can provide a reliable solution for real-time air transportation management.

Table 7: Model Complexity Comparisons. The look-back window size

L=60

and the prediction horizon

T=15

for all models .

Models	Parameters (MB)	FLOPs (M)	Running Time (s/iter)
FlightPatchNet(ours)	5.69	64.38	0.0069
FlightBERT++	44.26	3000	0.0112
FlightBERT	25.31	1620	0.2406
LSTM	0.03	1.67	0.0583
Bi-LSTM	0.51	31.15	0.1241
CNN-LSTM	0.04	1.22	0.0429

Appendix C Visualization

Visualization of FlightPatchNet Predictions

Figure 7 shows that FlightPatchNet can comprehensively capture the temporal variations of longitude and latitude, while it fails to fully reveal the temporal patterns from original altitude series.

Visualization of FlightPatchNet Predictions for Altitude

We present the visualization of FlightPatchNet predictions and ground truth for altitude in Figure 8. As shown in Figure 8, when the series of altitude are relatively smooth and stationary with obvious global trends, FlightPatchNet can effectively capture these trends and make accurate predictions. When the series suffers from many change points caused by frequent abrupt fluctuations, as depicted in Figure 8 and Figure 8, FlightPatchNet tends to focus on the irregular change points during prediction, leading to a large deviation from the ground truth. As a result, FlightPatchNet struggles to capture the real temporal variations in altitude and fails to provide accurate predictions.

3D Trajectory Visualization

We visualize the flight trajectory prediction results of FlightPatchNet and all the baselines when the prediction horizon is 15. As shown in Figure 9, FlightPatchNet can provide stable and the most accurate predictions in longitude and latitude while it suffers from slight fluctuations in altitude.

Appendix D Limitation and Future Work

FlightPatchNet has shown the most competitive performance for flight trajectory prediction. It is easy to implement and has fewer parameters, presenting a promising solution for real-time air traffic control applications. However, we also acknowledge the limitations of our work. Since the original series of altitude contains many fluctuations caused by unexpected noise, our primary focus on modeling temporal variations has led to a large bias towards output predictions of altitude. In the future, we will further explore temporal modeling of altitude and consider graph networks to overcome the prediction bias affected by unexpected noise.