11institutetext: School of Computer Science and Technology, Soochow University,
Suzhou 215006, China
11email: [email protected]
22institutetext: School of Biology & Basic Medical Science, Soochow University,
Suzhou 215006, China
33institutetext: School of Physics and Information Technology, Shaanxi Normal University,
Xi’an 710061, China

Boosting MLPs with a Coarsening Strategy for Long-Term Time Series Forecasting

Nannan Bian 11    Minhong Zhu 22    Li Chen 33    Weiran Cai (🖂) 11
Abstract

Deep learning methods have been exerting their strengths in long-term time series forecasting. However, they often struggle to strike a balance between expressive power and computational efficiency. Here, we propose the Coarsened Perceptron Network (CP-Net), a novel architecture that efficiently enhances the predictive capability of MLPs while maintains a linear computational complexity. It utilizes a coarsening strategy as the backbone that leverages two-stage convolution-based sampling blocks. Based purely on convolution, they provide the functionality of extracting short-term semantic and contextual patterns, which is relatively deficient in the global point-wise projection of the MLP layer. With the architectural simplicity and low runtime, our experiments on seven time series forecasting benchmarks demonstrate that CP-Net achieves an improvement of 4.1% compared to the SOTA method. The model further shows effective utilization of the exposed information with a consistent improvement as the look-back window expands.

Keywords:
time series forecasting coarsening strategy pattern extraction.

1 Introduction

Long-term multivariate time series forecasting, encompassing the prediction of future changes over an extended period using a substantial amount of historical data, finds diverse applications in the real world. Examples include weather forecasting, traffic flow prediction, economic planning, and electricity demand forecasting. Given the non-linear modeling capabilities of neural networks, recent studies concentrate on capturing intricate time patterns by employing deep learning methods [9, 24, 21, 11] as substitutes for traditional statistical approaches [5, 1, 15] in real world time series analyses.

Similar to that in the language processing domain, time series are composed of temporal patterns, where each time point may not only form short-term dependencies with adjacent time points, such as hourly relations in the consumption of electricity, but also jointly constitute long-term global dependencies with other distant ones, such as the quarterly or yearly variations in the above case. However, in the short-term aspect, the unique nature of time series, being a collection of continuously recorded values arranged in the temporal order, implies that a single time point often lacks adequate information for analysis. This sets it apart from language modeling, where each word carries a specific semantic meaning. Yet in the long-term aspect, when dealing with long input sequences, time series forecasting shares the same bottleneck regarding efficiency as with language modelling. Hence, an efficient deep learning model is consistently in pursuit that is able to comprehend both short- and long-term temporal patterns in a compound way, while maintaining a low computational complexity.

For long-term dependencies, the success of Transformer in sequence modeling [17, 6] has made it a good choice for capturing different scales of patterns in time series. While the correlation matrix learned by the attention mechanism can reflect the pairwise temporal dependencies between time steps, the quadratic complexity of attention computation poses a challenge for scaling the canonical Transformer to accommodate longer input time series. AutoFormer [21] and FEDFormer [25] manage to mitigate such deficiencies by utilizing autocorrelation mechanism and Fourier enhancement structure respectively as alternatives to canonical attention. Nevertheless, these adaptations remain inadequate for modeling time-series, as tokens derived from individual time points lack sufficient semantic meaning. More recently, efforts have also been exerted to compensate the short-term end in the attention mechanism. For instance, PatchTST [12] divides time series into patches with rich semantic information as input tokens, outperforming previous Transformer-based methods. However, this patching strategy does not mitigate the deficiency regarding quadratic complexity.

In another category, attempts leverage Convolutional Neural Networks (CNNs) in favour of its ability to capture short-term patterns while maintaining the efficiency. Confronted with the challenge of capturing long-term contextual dependencies in convolutional methods, TCN [2] addresses this by stacking layers of dilated convolutions to extend the receptive field. However, it needs to seek a balance between modeling longer-term dependencies and accurately capturing short-term patterns. Recent efforts primarily focus on addressing this deficiency. MICN [18] uses down-sampled convolutions and isometric convolutions to combine local features and global correlations to capture a holistic view of the time series. TimesNet [20] converts 1D time series into multiple 2D tensors and uses a parameter-efficient inception block [14] to capture intra- and inter-period variance simultaneously. However, constrained by the characteristics of convolution and down-sampling, these methods still lack the capability to model long-term dependencies as effectively as Transformer-based or MLP-based models do.

The simple framework resorting to linear- or MLP-based layers provides a potential solution for capturing long-term dependencies while preserving a linear computational complexity. It traces back to DLinear that questions the necessity of attention mechanism and achieves comparable performances by simply adopting a linear layer in association with timescale separation [22]. This makes the use of linear- or MLP-based layers a new trend that substitutes Transformers in time series forecasting. Yet, it is notable that such well performing structures, with its simplicity, is not adept at capturing short-term patterns, as neither its input nor output has established sufficient contextual information by operating point-wise. Therefore, engaging an effective enhancing strategy will be necessary for such a framework.

Based on the aforementioned observations, we propose Coarsened Perceptron Network (CP-Net), a two-stage framework composed of Token Projection Blocks and Contextual Sampling Blocks. The main motivation behind our design is to employ a coarsening scheme to enhance the efficiency of MLP layer, which is to ensure rich semantic and contextual patterns to be detected in the prediction.

Centered on the functionality of the MLP, the two consisting blocks introduce enhancement prior and posterior to the global point-wise projection, respectively. On one hand, the Token Projection Block transfers the input signal to tokens that encompassing indispensable semantic information that is vague in the point-wise MLP projection, which resembles PatchTST that pioneers in using tokens (patches). On the other hand, the Contextual Sampling Block, featuring a combination of dilated and equi-convolutions, forms a down-sampling functionality that supplements contextual patterns in the output of the MLP projection. Both stages are designed in the spirit of information coarsening as to preserve crucial temporal correlations while filter out potential noises. Notably, the components sandwiching the MLP are purely convolution-based, which renders the advantage of an overall linear computational complexity.

The main contributions of our work are summarized as follows:

  • We propose CP-Net featuring a two-stage coarsening strategy that enhances the efficiency of the globally map** MLP layers by extracting a wealth of local contextual patterns. The backbone architecture consists of sequentially arranged Token Projection Blocks and Contextual Sampling Blocks to extract semantic patterns and contextual temporal patterns, respectively.

  • Our enhancing components are purely convolution-based, which are paralleled in multiple branches to integrate the temporal patterns at different scales. This architectural simplicity renders the computation within a linear complexity.

  • Experiments on seven popular multivariate long-term forecasting benchmarks show performance improvements over the state-of-the-art method, achieving a decrease of 4.1% and 3.3% on MSE and MAE, respectively, averaged over four prediction lengths.

2 Related Work

2.1 Convolution-based Forecasting Approaches

The framework of convolutional neural networks (CNNs) features a filtering kernel that is crafted to capture local features optimally in the temporal signals. While employing CNNs holds the potential to enhance the precise capture of local patterns, challenges may arise in gras** global ones. TCN [2] is the first attempt in this direction which employs causal convolution to avoid leakage of future information and extends the receptive field of convolution kernels with dilated convolution to model longer-range dependencies. SCINet [10] goes for a diverse path by utilizing convolution filters to extract features from the downsampled subsequences. MICN [18] further improves the idea of downsampling, which leverages downsampling convolution and isometric convolution to extract local features and global correlation; it further introduces a multi-scale structure to capture patterns in different time scales. More recently, TimesNet [20] introduces a novel 2D convolution architecture to capture intraperiod- and interperiod-variations simultaneously. By converting 1D time series into a collection of 2D tensors based on multiple periods, it effectively mitigates the limitation of previous downsampling methods acting on 1D time series. However, despite considerable efforts in optimizing long-range patterns capturing, CNN-based methods are still limited by its receptive field of the kernels leading to incomplete awareness of distant time points.

2.2 Attention-based Forecasting Approaches

Transformers [17] have shown great success in NLP [6] and CV [7]. The exceptional proficiency of Transformers in comprehending long-range dependencies has spurred researchers to investigate customized adaptations specifically designed for multivariate time series forecasting [19]. Considering the quadratic complexity of Transformer, early attempts treat different variables at the same time step as a unified entity and design specialized attention modules for efficiency. Informer [24] uses the ProbSparse self-attention mechanism to reduce time complexity to O(LlogL)𝑂𝐿𝑙𝑜𝑔𝐿O(LlogL)italic_O ( italic_L italic_l italic_o italic_g italic_L ), accompanied with self-attention distillation for extremely long input sequences. AutoFormer [21] employs sequence decomposition to uncover trend patterns in time series, and designed an autocorrelation mechanism based on series periodicity with the complexity of O(LlogL)𝑂𝐿𝑙𝑜𝑔𝐿O(LlogL)italic_O ( italic_L italic_l italic_o italic_g italic_L ) as well. FEDFormer [25] turns to frequency domain and designs a Fourier frequency enhancement module with O(L)𝑂𝐿O(L)italic_O ( italic_L ) complexity to further improve long-term forecasting.

Despite merits over early CNN-based and RNN-based methods, Transformer-based approaches were later criticized for their limitations in maintaining temporal information [22] and lacking semantic meaning for individual time steps [12]. These challenges pose a potential hindrance to their effectiveness in capturing long-range patterns in time series data. A recent attempt PatchTST [12] explores to mitigate such deficiencies by using overlap** patches [4] rather than single time point as input tokens to enhance locality and capture more comprehensive semantic information, while reducing quadratic complexity of canonical Transformers. Together with the use of channel-independence prediction to rectify prior improper treatment, wherein different variables at the same timestep were considered as a unified entity, PatchTST surpasses other Transformer-based models and achieves SOTA performance.

2.3 Linear- and MLP-based Forecasting Approaches

The recent work DLinear [22] relies solely on simple linear model and surpasses most Transformer methods. DLinear employs only one linear layer for each component after decomposing the time series into trend and seasonal components. This motivates further exploration into utilizing linear layers for time series forecasting to reduce computational complexity while preserving effectiveness. Prior to this, there have been many studies using the MLP structure for time series forecasting. N-HiTS [3] is an extension of the famous N-Beats [13] model for long-term time series forecasting, which solves the problem of the volatility of the predictions and their computational complexity by incorporating novel hierarchical interpolation and multi-rate data sampling techniques. With the architectural simplicity, frameworks of this type have spurred a new trend in time series forecasting. As an example, LightTS [23] is a lightweight deep learning architecture whose key idea is to apply an MLP-based structure to implement interval sampling and continuous sampling strategies to reflect the dependencies of the original sequence.

3 Methods

For multivariate time series forecasting with N𝑁Nitalic_N different variables, given a historical sequence 𝐗I×N𝐗superscript𝐼𝑁\mathbf{X}\in\mathbb{R}^{I\times N}bold_X ∈ blackboard_R start_POSTSUPERSCRIPT italic_I × italic_N end_POSTSUPERSCRIPT of I𝐼Iitalic_I steps, our goal is to predict the future O𝑂Oitalic_O time steps 𝐘O×N𝐘superscript𝑂𝑁\mathbf{Y}\in\mathbb{R}^{O\times N}bold_Y ∈ blackboard_R start_POSTSUPERSCRIPT italic_O × italic_N end_POSTSUPERSCRIPT. We target at a scalable approach for the effective comprehension of both long- and short-term temporal patterns. In this section, we introduce the CP-Net, a framework that features a purely convolution-based coarsening strategy, composed of Token Projection Blocks and Contextual Sampling Blocks, which serve to restore important short-term patterns to the global point-wise projection of the MLP layer.

Refer to caption
Figure 1: Overview of CP-Net. Two-stage coarsening strategy: Time points in input signals are coarsened prior to the projection of the MLP layer with a Token Projection Block as to render a preliminary prediction. Posterior to that, short-term correlations are further extracted with a Contextual Sampling Block. Multi-scale merging: The multi-branch setting decodes and fuses the output information of diverse granularities to render a compound prediction. Detailed convolutional structures: The token Projection Block aggregates semantic information by employing a standard convolution, whereas the Contextual Sampling Block incorporates temporal dependencies and filters out volatile noises by proper down-sampling through dilated and equi-spaced convolutions.

3.1 Overall Structure

The CP-Net employs sequentially a Token Projection Block and a Contextual Sampling Block to derive the future series representation, as illustrated in Fig. 1. Initially, the Token Projection Block realizes a global projection to construct a preliminary prediction, for which the input signals are aggregated to form coarsened tokens prior to entering the map** function of an MLP layer. Subsequently, the Contextual Sampling Block is employed after the MLP layer to extract contextual information from this preliminary output by establishing relations among sampled points over a longer temporal range. This is realized by applying a second coarsening layer that functions as down-sampling, as to preserve significant local information while mitigate noises.

Specifically, given the time series 𝐗𝐢𝐧subscript𝐗𝐢𝐧\mathbf{X_{in}}bold_X start_POSTSUBSCRIPT bold_in end_POSTSUBSCRIPT normalized by Instance Normalization [16, 8] as input, the output representation through the bi-stage process can be computed as

𝐘=Merge(CS(TP(𝐗𝐢𝐧))),𝐘MergeCSTPsubscript𝐗𝐢𝐧\begin{split}{\mathbf{Y}}=\operatorname{Merge}\left(\operatorname{CS}\left(% \operatorname{TP}\left(\mathbf{X_{in}}\right)\right)\right),\\ \end{split}start_ROW start_CELL bold_Y = roman_Merge ( roman_CS ( roman_TP ( bold_X start_POSTSUBSCRIPT bold_in end_POSTSUBSCRIPT ) ) ) , end_CELL end_ROW (1)

where TP()TP\operatorname{TP(\cdot)}roman_TP ( ⋅ ) and CS()CS\operatorname{CS(\cdot)}roman_CS ( ⋅ ) denotes the Token Projection Block and Contextual Sampling Block, respectively. Various temporal patterns are ultimately fused to achieve a final prediction through a multi-scale merging approach Merge()Merge\operatorname{Merge}(\cdot)roman_Merge ( ⋅ ), which integrate the processing paths with different choices of token lengths and sampling rates in the respective blocks. It is worth noting that we utilize the channel-independent assumption similar to that in [12]. In the following subsections, we detail the designs of comprising blocks.

3.2 Token Projection Block

The Token Projection Block takes the normalized raw time series as input and outputs a preliminary future prediction through the MLP map** of constructed tokens. As shown in Fig. 1, the process begins with transforming the input time-series into overlap** consecutive coarse-grained tokens, which is mainly to behold semantic information in the local patterns. An MLP projects these coarse-grained tokens into preliminary future predictions.

Concretely, let 𝐗𝐢𝐧I×Nsubscript𝐗𝐢𝐧superscript𝐼𝑁\mathbf{X_{in}}\in\mathbb{R}^{I\times N}bold_X start_POSTSUBSCRIPT bold_in end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_I × italic_N end_POSTSUPERSCRIPT be the input series, and suppose TL𝑇𝐿TLitalic_T italic_L is the length of the coarse-grained tokens, the preliminary prediction can be computed as

𝐗𝐭𝐩=TP(𝐗𝐢𝐧)=MLP(Conv1d(𝐗𝐢𝐧)),subscript𝐗𝐭𝐩TPsubscript𝐗𝐢𝐧MLPConv1dsubscript𝐗𝐢𝐧\begin{split}{\mathbf{X_{tp}}}={\operatorname{TP(\mathbf{X_{in}})}}=% \operatorname{MLP}\left(\operatorname{Conv1d}\left(\mathbf{X_{in}}\right)% \right),\\ \end{split}start_ROW start_CELL bold_X start_POSTSUBSCRIPT bold_tp end_POSTSUBSCRIPT = start_OPFUNCTION roman_TP ( bold_X start_POSTSUBSCRIPT bold_in end_POSTSUBSCRIPT ) end_OPFUNCTION = roman_MLP ( Conv1d ( bold_X start_POSTSUBSCRIPT bold_in end_POSTSUBSCRIPT ) ) , end_CELL end_ROW (2)

where Conv1d()Conv1d\operatorname{Conv1d}(\cdot)Conv1d ( ⋅ ) denotes a standard 1D convolution layer with its kernel size corresponding to the token length TL𝑇𝐿TLitalic_T italic_L, and MLP()MLP\operatorname{MLP}(\cdot)roman_MLP ( ⋅ ) is a 2-layer perceptron realizing the global projection. 𝐗𝐭𝐩O×Nsubscript𝐗𝐭𝐩superscript𝑂𝑁\mathbf{X_{tp}}\in\mathbb{R}^{O\times N}bold_X start_POSTSUBSCRIPT bold_tp end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_O × italic_N end_POSTSUPERSCRIPT serves as the preliminary output prediction with this block.

3.3 Contextual Sampling Block

The Contextual Sampling Block takes the preliminary representation from the Temporal Projection Block as input and further establishes temporal dependency over a wider time span. Consecutive time points are down-sampled and aggregated, which generates new coarse-grained representations that retain essential temporal information while effectively filter out noise. This is realized by employing dilated convolution that captures contextual temporal patterns in a periodic manner where the dilation rate is the sampling rate SR𝑆𝑅SRitalic_S italic_R. In order to avoid information loss with padding zeros or mean values, we reuse the latest time steps in the input historical series as padding elements, which are concatenated with the preliminary prediction from the previous block. The above process can be formulated as

𝐗𝐜𝐬=DilatedConv1d(Concat(𝐗𝐭𝐩,𝐗𝐢𝐧))subscript𝐗𝐜𝐬DilatedConv1dConcatsubscript𝐗𝐭𝐩subscript𝐗𝐢𝐧\begin{split}{\mathbf{{X}_{cs}}}=\operatorname{DilatedConv1d}\left(% \operatorname{Concat}(\mathbf{X_{tp},\mathbf{X_{in}}})\right)\\ \end{split}start_ROW start_CELL bold_X start_POSTSUBSCRIPT bold_cs end_POSTSUBSCRIPT = DilatedConv1d ( roman_Concat ( bold_X start_POSTSUBSCRIPT bold_tp end_POSTSUBSCRIPT , bold_X start_POSTSUBSCRIPT bold_in end_POSTSUBSCRIPT ) ) end_CELL end_ROW (3)

where 𝐗𝐜𝐬(I+O)×Nsubscript𝐗𝐜𝐬superscript𝐼𝑂𝑁\mathbf{{X}_{cs}}\in\mathbb{R}^{(I+O)\times N}bold_X start_POSTSUBSCRIPT bold_cs end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT ( italic_I + italic_O ) × italic_N end_POSTSUPERSCRIPT, DilatedConv1d()DilatedConv1d\operatorname{DilatedConv1d}(\cdot)DilatedConv1d ( ⋅ ) is a 1D dilated convolution layer, and Concat()Concat\operatorname{Concat}(\cdot)roman_Concat ( ⋅ ) is the specialized padding strategy reusing the historical series.

Subsequently, equispaced convolution, a type of convolution where the kernel size is equal to the stride, is employed for generating coarse-grained representations. Again with the sampling rate SR𝑆𝑅SRitalic_S italic_R, the coarse-grained representation is then

𝐗𝐦=EquiConv1d(𝐗𝐜𝐬)subscript𝐗𝐦EquiConv1dsubscript𝐗𝐜𝐬\begin{split}{\mathbf{X_{m}}}=\operatorname{EquiConv1d}\left(\mathbf{{X}_{cs}}% \right)\\ \end{split}start_ROW start_CELL bold_X start_POSTSUBSCRIPT bold_m end_POSTSUBSCRIPT = EquiConv1d ( bold_X start_POSTSUBSCRIPT bold_cs end_POSTSUBSCRIPT ) end_CELL end_ROW (4)

where EquiConv1d()EquiConv1d\operatorname{EquiConv1d}(\cdot)EquiConv1d ( ⋅ ) is the equispaced convolution with kernel size SR𝑆𝑅SRitalic_S italic_R and 𝐗𝐦M×Nsubscript𝐗𝐦superscript𝑀𝑁\mathbf{X_{m}}\in\mathbb{R}^{M\times N}bold_X start_POSTSUBSCRIPT bold_m end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_M × italic_N end_POSTSUPERSCRIPT, with M=(I+O)/SR𝑀𝐼𝑂𝑆𝑅M=(I+O)/SRitalic_M = ( italic_I + italic_O ) / italic_S italic_R representing the length of the output representation.

3.4 Multi-Scale Merging

In a multi-scale setting, we parallelize the Token Projection Blocks and Contextual Sampling Blocks with varying token lengths and sampling rates to assure temporal patterns across different scales are properly captured. The outputs of multiple branches are fused through a convolution-based merging approach. We define 𝐒={(TLi,SRj)}𝐒𝑇subscript𝐿𝑖𝑆subscript𝑅𝑗\mathbf{S}=\{(TL_{i},SR_{j})\}bold_S = { ( italic_T italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_S italic_R start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) } as the ensemble of parameters for all such branches containing a Token Projection Block and a Contextual Sampling Block (values are set separately for simplicity). Multiple branches corresponding to the pairs defined in 𝐒𝐒\mathbf{S}bold_S can be computed in parallel, generating coarse-grained representations at different scales.

Since each branch preserves a granularity of information at a different scale, the coarse-grained representations obtained from different branches are decoded independently. Specifically, separate predictors are employed to decode the coarsened representation and formulate the future prediction for each branch

𝐘𝐦𝐢,𝐣=Predictor(𝐗𝐦𝐢,𝐣),superscriptsubscript𝐘𝐦𝐢𝐣Predictorsuperscriptsubscript𝐗𝐦𝐢𝐣\begin{split}{\mathbf{Y_{m}^{i,j}}}=\operatorname{Predictor}\left(\mathbf{X_{m% }^{i,j}}\right),\\ \end{split}start_ROW start_CELL bold_Y start_POSTSUBSCRIPT bold_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_i , bold_j end_POSTSUPERSCRIPT = roman_Predictor ( bold_X start_POSTSUBSCRIPT bold_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_i , bold_j end_POSTSUPERSCRIPT ) , end_CELL end_ROW (5)

where Predictor()Predictor\operatorname{Predictor}(\cdot)roman_Predictor ( ⋅ ) is a 2-layer perceptron. 𝐘𝐦𝐢,𝐣(I+O)×Nsuperscriptsubscript𝐘𝐦𝐢𝐣superscript𝐼𝑂𝑁\mathbf{Y_{m}^{i,j}}\in\mathbb{R}^{(I+O)\times N}bold_Y start_POSTSUBSCRIPT bold_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_i , bold_j end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT ( italic_I + italic_O ) × italic_N end_POSTSUPERSCRIPT serves as the future prediction of a given branch. We utilize 2D convolution to blend the branch-wise predictions 𝐘𝐦𝐢,𝐣superscriptsubscript𝐘𝐦𝐢𝐣\mathbf{Y_{m}^{i,j}}bold_Y start_POSTSUBSCRIPT bold_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_i , bold_j end_POSTSUPERSCRIPT, which are truncated to the desired output length. With the given parameters 𝐒𝐒\mathbf{S}bold_S, the final prediction is computed as

𝐘=Truncate(Conv2d(i,j)𝐒(𝐘𝐦𝐢,𝐣))𝐘Truncate𝑖𝑗𝐒Conv2dsuperscriptsubscript𝐘𝐦𝐢𝐣\begin{split}{\mathbf{Y}}=\operatorname{Truncate}\left(\underset{(i,j)\in% \mathbf{S}}{\operatorname{Conv2d}}\left(\mathbf{Y_{m}^{i,j}}\right)\right)\\ \end{split}start_ROW start_CELL bold_Y = roman_Truncate ( start_UNDERACCENT ( italic_i , italic_j ) ∈ bold_S end_UNDERACCENT start_ARG Conv2d end_ARG ( bold_Y start_POSTSUBSCRIPT bold_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_i , bold_j end_POSTSUPERSCRIPT ) ) end_CELL end_ROW (6)

where 𝐘O×N𝐘superscript𝑂𝑁\mathbf{Y}\in\mathbb{R}^{O\times N}bold_Y ∈ blackboard_R start_POSTSUPERSCRIPT italic_O × italic_N end_POSTSUPERSCRIPT represents the resulted final prediction.

4 Experiments

4.1 Multivariate Long-term Time Series Forecasting

4.1.1 Datasets

We evaluate the performance of our proposed CP-Net on seven datasets, including ETT [24] (ETTm1, ETTm2, ETTh1, ETTh2), Electricity, Traffic and Weather. These datasets have been widely used as benchmarks, whose public splits and evaluation standards are available on [21]. The statistics of those datasets are shown in Table 1.

Table 1: Statistics of seven commonly used benchmark datasets.
dataset ETTm1 ETTm2 ETTh1 ETTh2 Electricity Traffic Weather \bigstrut
variates 7 7 7 7 321 862 21 \bigstrut
time steps 69680 69680 17420 17420 26304 17544 52696 \bigstrut
granularity 15 mins 15 mins Hourly Hourly Hourly Hourly 10 mins \bigstrut

4.1.2 Baseline Models and Setup

We choose representative models from three categories as our baselines, including the CNN-based model TimesNet [20], the Transformer-based models PatchTST [12], FEDformer [25], Autoformer [21], and the Linear- and MLP-based models DLinear [22] and LightTS [23]. Notably, PatchTST utilizes tokens (patches) as the transferred input in a similar manner as with our Token Projection Block.

Table 2: Multivariate long-term time series forecasting results with CP-Net and baseline models. We set the input length I=96𝐼96I=96italic_I = 96, and forecasting length O{96,192,336,720}𝑂96192336720O\in\{96,192,336,720\}italic_O ∈ { 96 , 192 , 336 , 720 }. The lower the MSE or MAE, the better the results, with the best results highlighted in bold and the second best underlined.
Models

Ours

TimesNet

PatchTST

Dlinear

LightTS

FEDformer

Autoformer

   \bigstrut
Metric

MSE

MAE

MSE

MAE

MSE

MAE

MSE

MAE

MSE

MAE

MSE

MAE

MSE

MAE

\bigstrut
ETTm1 96

0.321

0.359

0.338

0.375

0.339

0.368

0.345

0.372

0.374

0.400

0.379

0.419

0.505

0.475

\bigstrut[t]
192

0.367

0.383

0.374

0.387

0.374

0.388

0.380

0.389

0.400

0.407

0.426

0.441

0.553

0.496

336

0.400

0.406

0.410

0.411

0.406

0.405

0.413

0.413

0.438

0.438

0.445

0.459

0.621

0.537

720

0.462

0.441

0.478

0.450

0.462

0.440

0.474

0.453

0.527

0.502

0.543

0.490

0.671

0.561

\bigstrut[b]
ETTm2 96

0.176

0.259

0.187

0.267

0.176

0.260

0.193

0.292

0.209

0.308

0.203

0.287

0.255

0.339

\bigstrut[t]
192

0.241

0.300

0.249

0.309

0.242

0.303

0.284

0.362

0.311

0.382

0.269

0.328

0.281

0.340

336

0.299

0.337

0.321

0.351

0.302

0.342

0.369

0.427

0.442

0.466

0.325

0.366

0.339

0.372

720

0.397

0.394

0.408

0.403

0.399

0.396

0.554

0.522

0.675

0.587

0.421

0.415

0.433

0.432

\bigstrut[b]
ETTh1 96

0.380

0.397

0.384

0.402

0.410

0.416

0.386

0.400

0.424

0.432

0.376

0.419

0.449

0.459

\bigstrut[t]
192

0.435

0.429

0.436

0.429

0.459

0.444

0.437

0.432

0.475

0.462

0.420

0.448

0.500

0.482

336

0.479

0.450

0.491

0.469

0.500

0.465

0.481

0.459

0.518

0.488

0.459

0.465

0.521

0.496

720

0.490

0.473

0.521

0.500

0.498

0.487

0.519

0.516

0.547

0.533

0.506

0.507

0.514

0.512

\bigstrut[b]
ETTh2 96

0.291

0.343

0.340

0.374

0.302

0.348

0.333

0.387

0.397

0.437

0.358

0.397

0.346

0.388

\bigstrut[t]
192

0.367

0.392

0.402

0.414

0.388

0.400

0.477

0.476

0.520

0.504

0.429

0.439

0.456

0.452

336

0.412

0.426

0.452

0.452

0.421

0.431

0.594

0.541

0.626

0.559

0.496

0.487

0.482

0.486

720

0.420

0.440

0.462

0.468

0.429

0.445

0.831

0.657

0.863

0.672

0.463

0.474

0.515

0.511

\bigstrut[b]
Electricity 96

0.180

0.264

0.168

0.272

0.196

0.285

0.197

0.282

0.207

0.307

0.193

0.308

0.201

0.317

\bigstrut[t]
192

0.186

0.269

0.184

0.289

0.198

0.289

0.196

0.285

0.213

0.316

0.201

0.315

0.222

0.334

336

0.202

0.286

0.198

0.300

0.214

0.304

0.209

0.301

0.230

0.333

0.214

0.329

0.231

0.338

720

0.243

0.319

0.220

0.320

0.256

0.336

0.245

0.333

0.265

0.360

0.246

0.355

0.254

0.361

\bigstrut[b]
Traffic 96

0.492

0.325

0.593

0.321

0.557

0.365

0.650

0.396

0.615

0.391

0.587

0.366

0.613

0.388

\bigstrut[t]
192

0.492

0.322

0.617

0.336

0.545

0.356

0.598

0.370

0.601

0.382

0.604

0.373

0.616

0.382

336

0.506

0.328

0.629

0.336

0.555

0.359

0.605

0.373

0.613

0.386

0.621

0.383

0.622

0.337

720

0.539

0.345

0.640

0.350

0.592

0.376

0.645

0.394

0.658

0.407

0.626

0.382

0.660

0.408

\bigstrut[b]
Weather 96

0.177

0.217

0.172

0.220

0.179

0.219

0.196

0.255

0.182

0.242

0.217

0.296

0.266

0.336

\bigstrut[t]
192

0.226

0.259

0.219

0.261

0.226

0.259

0.237

0.296

0.227

0.287

0.276

0.336

0.307

0.367

336

0.281

0.298

0.280

0.306

0.280

0.298

0.283

0.335

0.282

0.334

0.339

0.380

0.359

0.395

720

0.357

0.348

0.365

0.359

0.355

0.348

0.345

0.381

0.352

0.386

0.403

0.428

0.419

0.428

\bigstrut[b]

All of the models follow the same experimental setups with the same look-back window I=96𝐼96I=96italic_I = 96 and four prediction lengths O{96,192,336,720}𝑂96192336720O\in\{96,192,336,720\}italic_O ∈ { 96 , 192 , 336 , 720 }. We collect the baseline results from TimesNet [20] with a look-back window I=96𝐼96I=96italic_I = 96. For PatchTST, we run the officially provided code with default hyper-parameter settings and a different look-back window I=96𝐼96I=96italic_I = 96 from the original paper and reported the results. We employ the commonly used MSE and MAE as the metrics for evaluation.

4.1.3 Main Results

For multivariate long-term time series forecasting, our model outperforms or on par with the baselines on all benchmarks, as displayed in Table 2. In comparison with the current SOTA model Transformer-based PatchTST, our model with the simple architecture demonstrates superior performance with a 4.1% reduction in MSE and a 3.3% reduction in MAE. On large datasets such as Traffic and Electricity, the proposed model consistently outperforms PatchTST in all settings. Compared with the previous best CNN-based model TimesNet, our model achieves a 6.6% reduction in MSE and a 3.3% reduction in MAE. Notably on the largest Traffic dataset, our model’s MSE is 18.2% lower. In contrast to the top-performing linear model DLinear, our model surpasses it significantly with a 14.4% decrease in MSE and an 11.6% decrease in MAE.

Table 3: Ablation study by removing Token Projection module and/or Contextual Sampling module on the Electricity, Traffic and Weather datasets. The best results are highlighted in bold.
Models CP-Net w/o TP w/o CS w/o TP&CS     \bigstrut
Metric MSE MAE MSE MAE MSE MAE MSE MAE \bigstrut
Electricity 96 0.180 0.264 0.186 0.270 0.212 0.289 0.216 0.293 \bigstrut[t]
192 0.186 0.269 0.190 0.275 0.203 0.285 0.208 0.290
336 0.202 0.286 0.207 0.291 0.217 0.299 0.222 0.304
720 0.243 0.319 0.248 0.324 0.260 0.332 0.264 0.337 \bigstrut[b]
Traffic 96 0.492 0.325 0.519 0.344 0.603 0.399 0.635 0.412 \bigstrut[t]
192 0.492 0.322 0.514 0.340 0.560 0.369 0.588 0.384
336 0.506 0.328 0.527 0.345 0.573 0.373 0.598 0.387
720 0.539 0.345 0.561 0.362 0.610 0.392 0.640 0.407 \bigstrut[b]
Weather 96 0.177 0.217 0.180 0.219 0.198 0.240 0.201 0.245 \bigstrut[t]
192 0.226 0.259 0.229 0.261 0.245 0.275 0.247 0.279
336 0.281 0.298 0.284 0.300 0.295 0.309 0.297 0.312
720 0.357 0.348 0.359 0.348 0.366 0.354 0.368 0.356 \bigstrut[b]

4.2 Ablation Study

To verify the effectiveness of our proposed coarsening strategy in the model backbone, we conduct an ablation study on different variants of the model. Considering that the main purpose is to verify the coarsening module, we retain the MLP layer and independently remove the semantic coarsening and contextual coarsening modules independently. For clarity, we refer to TP and CS as the two coarsening modules. Concretely, we test the performance of the following four variants:

  • CP-Net: represents the standard model we propose.

  • w/o TP: represents removing coarsening in the TP block.

  • w/o CS: represents removing coarsening in the CS block.

  • w/o TP&CS: represents that both coarsening modules are removed only with the MLP left.

Refer to caption
Figure 2: Impact of the number of branches on the Electricity dataset. The horizontal axis (NTL,NSR)subscript𝑁𝑇𝐿subscript𝑁𝑆𝑅(N_{TL},N_{SR})( italic_N start_POSTSUBSCRIPT italic_T italic_L end_POSTSUBSCRIPT , italic_N start_POSTSUBSCRIPT italic_S italic_R end_POSTSUBSCRIPT ) represents the numbers of token lengths and sampling rates, respectively (for simplicity they are set to be identical).

We carry out the ablation study on the three largest datasets, Electricity, Traffic and Weather, and the results are shown in Table 3. Overall, the two coarsening modules together significantly improve the performance compared with the raw MLP model, with respective improvements of 10.3%, 17.6%, and 6.5% in MSE on the three datasets. Using either one independently can enhance the performance of the MLP model, which demonstrate the importance of enhancing MLP model with the effective extraction of short-term patterns. Yet, it is noticeable that while incorporating the coarsening in the TP contributes to the performance, the coarsening module in the CS block exhibits a more pronounced effect. The reason may be attributed to the fact that this module not only captures periodic patterns through dilated and equispaced convolutions but also retains information by eliminating noise through the down-sampling process.

Furthermore, we conduct experiments to verify the impact of the number of branches in the backbone on the performance, taking the Electricity dataset as an example. For simplicity, we set here the number of token lengths as the number of sampling rates, and we use the optimal parametric combination in each case. As illustrated in Fig. 2, the results show that there is a noticeable improvement as the number of branches increases in the forecasting task. However, the improvement becomes less pronounced when the number of branches reaches 4. Therefore, to balance operational efficiency and performance, we ultimately choose 3 as the number of branches.

4.3 Model Analyses

4.3.1 Consistency with look-back windows.

Previous work [22, 12] shows that some Transformer-based models such as Autoformer and FEDformer degrade in the accuracy of prediction as the look-back window expands. With a longer look-back window, more useful information is exposed, such as long-term dependencies which cannot be captured within shorter ones. Therefore, a good time series prediction model should be able to make more accurate prediction as the look-back window expands. To further examine whether our model has this desirable ability, we train our model with different look-back windows I{48,96,192,336,720}𝐼4896192336720I\in\{48,96,192,336,720\}italic_I ∈ { 48 , 96 , 192 , 336 , 720 } and compare its performance with other state-of-the-art models. The results are shown in Fig. 3. Our model performs excellently across varying look-back windows, with its performance consistently improving as the look-back window expands. This further validates the model’s capability to effectively utilize both short- and long-term information. Moreover, we can observe that TimesNet fails to achieve better performance when look-back window expands, which demonstrate the idea 2D convolution may still lack the essence of modeling long-term dependencies. At the same time, both PatchTST and DLinear exhibit subpar performance when the look-back window is small. This implies that attention and purely linear map** may struggle to capture short-term patterns with a good accuracy.

Refer to caption
Figure 3: Forecasting performance (MSE) with varying look-back window widths I{48,96,192,336,720}𝐼4896192336720I\in\{48,96,192,336,720\}italic_I ∈ { 48 , 96 , 192 , 336 , 720 } on the Traffic, Electricity and ETTm1 datasets. The prediction length is fixed at O=96𝑂96O=96italic_O = 96.

4.3.2 Training and Inference Efficiency.

Beyond excelling performance compared with baseline models, we further demonstrate the high efficiency of the proposed model attributed to its MLP-based architecture. In this regard, we compare CP-Net with the SOTA model PatchTST and focus on the training and inference time. For fair comparisons, we conduct experiments using the data loader in Autoformer [21]. We utilize a batch size of 8 for the Electricity dataset, resulting in batches of size 8 × 321 × I (with I being the varying width of the look-back window). Specifically, the Electricity dataset consists of 321 individual time series. We report the inference time per batch and the training time for one epoch for CP-Net and PatchTST respectively as the look-back window (I) varies from 96 to 2880, as shown in Fig. 4. Our model is significantly superior to PatchTST regarding both training and inference speeds. Besides, it has smaller increase in the training and inference time with the look-back window. This is in contrast with PatchTST which exhibits a greater sensitivity to the width of the look-back window (it runs out of memory when I2880𝐼2880I\geq 2880italic_I ≥ 2880). It can thus be concluded that with the architectural simplicity (based purely on MLPs and convolution layers of several types), our model achieves a superior or comparable accuracy with improved computational and memory efficiency compared to models based on the attention mechanism. All experiments for this runtime comparison were conducted using a single NVIDIA RTX3090Ti GPU on the same machine.

Refer to caption
Figure 4: Comparison of training and inference time against PatchTST based on the attention mechanism as one of the state-of-art models on the Electricity dataset. Note that PatchTST encountered GPU memory exhaustion for the look-back window width I2880𝐼2880I\geq 2880italic_I ≥ 2880.

5 Conclusion

We present CP-Net, a novel long-term multivariate time series forecasting model that effectively enhances the point-wise global projection of the MLP layer by incorporating characteristic short-term dependencies. At the core of CP-Net lies a multi-scale coarsening strategy composed of Token Projection Blocks and Contextual Sampling Blocks that integrate functions prior and posterior to the MLP. Through carefully design, CP-Net restores semantic patterns and long-range contextual patterns, which is to a large extent deficient with a sole MLP layer. In comparison with typical convolution-based models lacking the ability to model long-term dependencies and Transformer-based models with a higher computational complexity, our model is able to comprehend both local and global correlations in the time series while maintains a linear computational complexity. With this architectural simplicity, the extensive experimental results show that CP-Net outperforms or is on par with the SOTA models across nearly all the empirical datasets, which demonstrates the effectiveness of the proposed coarsening strategy for enhancing MLPs in modeling time series data.

References

  • [1] Ariyo, A., Adewumi, A., Ayo, C.: Stock price prediction using the arima model. In: 2014 UKSim-AMSS 16th International Conference on Computer Modelling and Simulation. pp. 106–112. IEEE (2014)
  • [2] Bai, S., Kolter, J., Koltun, V.: An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv preprint arXiv:1803.01271 (2018)
  • [3] Challu, C., Olivares, K.G., Oreshkin, B.N., Ramirez, F.G., Canseco, M.M., Dubrawski, A.: Nhits: Neural hierarchical interpolation for time series forecasting. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 37, pp. 6989–6997 (2023)
  • [4] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
  • [5] Hundman, K., Constantinou, V., Laporte, C., Colwell, I., Soderstrom, T.: Detecting spacecraft anomalies using lstms and nonparametric dynamic thresholding. In: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. pp. 387–395 (2018)
  • [6] Kalyan, K., Rajasekharan, A., Sangeetha, S.: Ammus: A survey of transformer-based pretrained models in natural language processing. arXiv preprint arXiv:2108.05542 (2021)
  • [7] Khan, S., Naseer, M., Hayat, M., Zamir, S.W., Khan, F.S., Shah, M.: Transformers in vision: A survey. ACM Computing Surveys (CSUR) 54(10s), 1–41 (2022)
  • [8] Kim, T., Kim, J., Tae, Y., Park, C., Choi, J.H., Choo, J.: Reversible instance normalization for accurate time-series forecasting against distribution shift. In: International Conference on Learning Representations (2021)
  • [9] Kitaev, N., Kaiser, Ł., Levskaya, A.: Reformer: The efficient transformer. arXiv preprint arXiv:2001.04451 (2020)
  • [10] Liu, M., Zeng, A., Chen, M., Xu, Z., Lai, Q., Ma, L., Xu, Q.: Scinet: Time series modeling and forecasting with sample convolution and interaction. Advances in Neural Information Processing Systems 35, 5816–5828 (2022)
  • [11] Liu, S., Yu, H., Liao, C., Li, J., Lin, W., Liu, A.X., Dustdar, S.: Pyraformer: Low-complexity pyramidal attention for long-range time series modeling and forecasting. In: International Conference on Learning Representations (2021)
  • [12] Nie, Y., Nguyen, N.H., Sinthong, P., Kalagnanam, J.: A time series is worth 64 words: Long-term forecasting with transformers. arXiv preprint arXiv:2211.14730 (2022)
  • [13] Oreshkin, B., Carpov, D., Chapados, N., Bengio, Y.: N-beats: Neural basis expansion analysis for interpretable time series forecasting. arXiv preprint arXiv:1905.10437 (2019)
  • [14] Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 1–9 (2015)
  • [15] Taylor, S., Letham, B.: Forecasting at scale. The American Statistician 72(1), 37–45 (2018)
  • [16] Ulyanov, D., Vedaldi, A., Lempitsky, V.: Instance normalization: The missing ingredient for fast stylization. arXiv preprint arXiv:1607.08022 (2016)
  • [17] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017)
  • [18] Wang, H., Peng, J., Huang, F., Wang, J., Chen, J., Xiao, Y.: Micn: Multi-scale local and global context modeling for long-term series forecasting. In: International Conference on Learning Representations (2022)
  • [19] Wen, Q., Zhou, T., Zhang, C., Chen, W., Ma, Z., Yan, J., Sun, L.: Transformers in time series: A survey. arXiv preprint arXiv:2202.07125 (2022)
  • [20] Wu, H., Hu, T., Liu, Y., Zhou, H., Wang, J., Long, M.: Timesnet: Temporal 2d-variation modeling for general time series analysis. In: International Conference on Learning Representations (2022)
  • [21] Wu, H., Xu, J., Wang, J., Long, M.: Autoformer: Decomposition transformers with auto-correlation for long-term series forecasting. Advances in Neural Information Processing Systems 34, 22419–22430 (2021)
  • [22] Zeng, A., Chen, M., Zhang, L., Xu, Q.: Are transformers effective for time series forecasting? In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 37, pp. 11121–11128 (2023)
  • [23] Zhang, T., Zhang, Y., Cao, W., Bian, J., Yi, X., Zheng, S., Li, J.: Less is more: Fast multivariate time series forecasting with light sampling-oriented MLP structures. arXiv preprint arXiv:2207.01186 (2022)
  • [24] Zhou, H., Zhang, S., Peng, J., Zhang, S., Li, J., Xiong, H., Zhang, W.: Informer: Beyond efficient transformer for long sequence time-series forecasting. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 35, pp. 11106–11115 (2021)
  • [25] Zhou, T., Ma, Z., Wen, Q., Wang, X., Sun, L., **, R.: Fedformer: Frequency enhanced decomposed transformer for long-term series forecasting. In: International Conference on Machine Learning. pp. 27268–27286. PMLR (2022)