11institutetext: School of Computer Science and Technology, Soochow University,
Suzhou 215006, China
11email: [email protected]
22institutetext: School of Biology & Basic Medical Science, Soochow University,
Suzhou 215006, China
33institutetext: School of Physics and Information Technology, Shaanxi Normal University,
Xi’an 710061, China

Boosting MLPs with a Coarsening Strategy for Long-Term Time Series Forecasting

Nannan Bian 11    Minhong Zhu 22    Li Chen 33    Weiran Cai (🖂) 11
Abstract

Deep learning methods have been exerting their strengths in long-term time series forecasting. However, they often struggle to strike a balance between expressive power and computational efficiency. Resorting to multi-layer perceptrons (MLPs) provides a compromising solution, yet they suffer from two critical problems caused by the intrinsic point-wise map** mode, in terms of deficient contextual dependencies and inadequate information bottleneck. Here, we propose the Coarsened Perceptron Network (CP-Net), featured by a coarsening strategy that alleviates the above problems associated with the prototype MLPs by forming information granules in place of solitary temporal points. The CP-Net utilizes primarily a two-stage framework for extracting semantic and contextual patterns, which preserves correlations over larger timespans and filters out volatile noises. This is further enhanced by a multi-scale setting, where patterns of diverse granularities are fused towards a comprehensive prediction. Based purely on convolutions of structural simplicity, CP-Net is able to maintain a linear computational complexity and low runtime, while demonstrates an improvement of 4.1% compared with the SOTA method on seven forecasting benchmarks. Code is available at https://github.com/nannanbian/CPNet

Keywords:
time series forecasting coarsening strategy pattern extraction.

1 Introduction

Long-term multivariate time series forecasting, encompassing the prediction of future changes over an extended period using a substantial amount of historical data, finds diverse applications in the real world. Examples include weather forecasting, traffic flow prediction, economic planning, and electricity demand forecasting. Given the non-linear modeling capabilities of neural networks, recent studies concentrate on capturing intricate time patterns by employing deep learning methods [17, 13] as substitutes for traditional statistical approaches [1, 9] in real world time series analyses.

Similar to that in the language processing domain, time series are composed of temporal patterns, where each time point may not only form short-term dependencies with adjacent time points, such as hourly relations in the consumption of electricity, but also jointly constitute long-term global dependencies with other distant ones, such as the quarterly or yearly variations in the above case. However, in the short-term aspect, the unique nature of time series, being a collection of continuously recorded values arranged in the temporal order, implies that a single time point often lacks adequate information for analysis. Yet in the long-term aspect, when dealing with long input sequences, time series forecasting shares the same efficiency challenge as language modelling. Hence, an efficient deep learning model is consistently in pursuit that is able to comprehend both short- and long-term temporal patterns in a compound way, while maintaining a low computational complexity.

Facing the dilemma, simple frameworks resorting to linear- or MLP-based layers provide a potential solution for capturing long-term dependencies while preserving a linear computational complexity. It traces back to DLinear that questions the necessity of attention mechanism and achieves comparable performances by simply adopting a linear layer in association with timescale separation [15]. This makes the use of linear- or MLP-based layers a new trend that substitutes Transformers in time series forecasting. Yet, by operating point-wise map** modes, these architectures generally suffer from two critical problems, i.e., they are deficient in preserving contextual dependencies and inadequate to form an information bottleneck for filtering out redundant noises. Therefore, engaging an effective enhancing strategy will be necessary for such frameworks.

To solve the aforementioned problems, we propose Coarsened Perceptron Network (CP-Net), a two-stage framework composed of multi-scale Token Projection Blocks and Contextual Sampling Blocks. The main motivation is to employ a coarsening scheme that forms granularity of information to address the limitations of point-wise map** of the MLP layer. This is to ensure rich semantic and contextual patterns to be detected in the time series and unwanted volatile information is adequately removed.

Centered on the functionality of the MLP, the enhancing scheme introduces enhancement both prior and posterior to the point-wise projection. It has two key advantages: On one hand, both stages are designed in the spirit of preserving crucial temporal correlations. The Token Projection Block transfers the input signal to tokens that encompass indispensable semantic information which is missing in the point-wise MLP projection; whereas the Contextual Sampling Block, featuring a combination of dilated and equi-convolutions, forms a down-sampling functionality that supplements contextual patterns in the output of the MLP projection. On the other hand, the information contained in granularity aids to filter out noises that would otherwise be present in the point-wise map**. To further preserve the temporal patterns of diverse granularity [11, 8], we implement this boosting strategy in a multi-scale setting by employing different token lengths and sampling rates, which are merged towards a comprehensive final prediction. Notably, all components sandwiching the core MLP in this architecture are purely convolution-based, which renders the advantage of an overall linear computational complexity.

The main contributions of our work are summarized as follows:

  • \bullet

    We propose CP-Net featuring a two-stage coarsening strategy that alleviates the intrinsic drawbacks of the point-wise map** of MLPs and boosts their efficiency in prediction tasks. The backbone architecture consists of sequentially arranged Token Projection Blocks and Contextual Sampling Blocks to extract semantic and contextual temporal patterns, respectively, which provides further an adequate information bottleneck for noise filtering.

  • \bullet

    Our enhancing components are purely convolution-based, which are paralleled in multiple branches to integrate the temporal patterns of different granularity. This architectural simplicity renders the computation within a linear complexity and low runtime.

  • \bullet

    Experiments on seven popular multivariate long-term forecasting benchmarks justify the proposed strategy with improvements over the state-of-the-art method, which achieves a decrease of 4.1% and 3.3% on MSE and MAE (averaged over four prediction lengths), respectively.

2 Related Work

2.1 Convolution-based Forecasting Approaches

The framework of convolutional neural networks (CNNs) features a filtering kernel that is crafted to capture local features optimally in the temporal signals. While employing CNNs holds the potential to enhance the precise capture of local patterns, challenges may arise in gras** global ones. TCN [2] is the first attempt in this direction which employs causal convolution to avoid leakage of future information and extends the receptive field of convolution kernels with dilated convolution to model longer-range dependencies. SCINet [5] goes for a diverse path by utilizing convolution filters to extract features from the downsampled subsequences. MICN [11] further improves the idea of downsampling, which leverages downsampling convolution and isometric convolution to extract local features and global correlation; it further introduces a multi-scale structure to capture patterns in different time scales. More recently, TimesNet [12] introduces a novel 2D convolution architecture to capture intraperiod- and interperiod-variations simultaneously. By converting 1D time series into a collection of 2D tensors based on multiple periods, it effectively mitigates the limitation of previous downsampling methods acting on 1D time series. However, despite considerable efforts in optimizing long-range patterns capturing, CNN-based methods are still limited by its receptive field of the kernels leading to incomplete awareness of distant time points. Regarding both the proficiency in extracting local information and low complexity, this work takes full advantages of convolutional layers throughout the enhancing strategy.

2.2 Linear- and MLP-based Forecasting Approaches

The recent work DLinear [15] relies solely on simple linear model and surpasses most Transformer methods. DLinear employs only one linear layer for each component after decomposing the time series into trend and seasonal components. This motivates further exploration into utilizing linear layers for time series forecasting to reduce computational complexity while preserving effectiveness. Prior to this, there have been many studies using the MLP structure for time series forecasting. N-HiTS [3] is an extension of the famous N-Beats [7] model for long-term time series forecasting, which solves the problem of the volatility of the predictions and their computational complexity by incorporating novel hierarchical interpolation and multi-rate data sampling techniques. LightTS [16] applies an MLP-based structure to implement interval sampling and continuous sampling strategies to reflect the dependencies of the original sequence. FreTS [14] transforms sequences into the frequency domain, redesigning the MLP to separately compute the real and imaginary parts of the frequency components. The superiority in the structural simplicity and the long-term scope encourages us to also utilize the MLP architecture as the core projection functionality. Yet, our work differs from existing models by adopting a coarsening strategy to solve the intrinsic shortcoming in the point-wise mode of map** of MLPs, based on the idea of forming information granules.

3 Methods

For multivariate time series forecasting with N𝑁Nitalic_N different variables, given a historical sequence 𝐗I×N𝐗superscript𝐼𝑁\mathbf{X}\in\mathbb{R}^{I\times N}bold_X ∈ blackboard_R start_POSTSUPERSCRIPT italic_I × italic_N end_POSTSUPERSCRIPT of I𝐼Iitalic_I steps, our goal is to predict the future O𝑂Oitalic_O time steps 𝐘O×N𝐘superscript𝑂𝑁\mathbf{Y}\in\mathbb{R}^{O\times N}bold_Y ∈ blackboard_R start_POSTSUPERSCRIPT italic_O × italic_N end_POSTSUPERSCRIPT. We target at a scalable approach for the effective comprehension of both long- and short-term temporal patterns. In this section, we introduce the CP-Net, a multi-scale framework that features a purely convolution-based coarsening strategy, composed of Token Projection Blocks and Contextual Sampling Blocks. These components are crucial for restoring important short-term patterns to the global point-wise projection of the MLP layer, ensuring a comprehensive understanding of the time series.

Refer to caption
Figure 1: Overview of CP-Net. Two-stage coarsening strategy: Time points in input signals are coarsened prior to the projection of the MLP layer with a Token Projection Block as to render a preliminary prediction. Posterior to that, short-term correlations are further extracted with a Contextual Sampling Block. Multi-scale merging: The multi-branch setting decodes and fuses the output information of diverse granularities to render a compound prediction. Detailed convolutional structures: The token Projection Block aggregates semantic information by employing a standard convolution, whereas the Contextual Sampling Block incorporates temporal dependencies and filters out volatile noises by proper down-sampling through dilated and equispaced convolutions.

3.1 Overall Structure

The two-stage coarsening framework employs a Token Projection Block and a Contextual Sampling Block sequentially to preserve important temporal correlations at the input and output of the MLP layer, respectively, as illustrated in Fig. 1. The Token Projection Block realizes a global projection to construct a preliminary prediction, for which the input signals are aggregated to form coarsened tokens prior to the MLP layer. Here, the token length (TL𝑇𝐿TLitalic_T italic_L), representing the number of time points used to aggregate coarse-grained tokens, is used to determine the input granularity. Subsequently, the Contextual Sampling Block establishes correlations over a time span that rejoins the historical input with the preliminary output of the MLP. It also provides a down-sampling function to preserve significant patterns while mitigate noises. The granularity of information here depends on the sampling rate (SR𝑆𝑅SRitalic_S italic_R).

Accounting for temporal patterns of diverse granularities in real-world scenarios, we employ a multi-scale strategy to integrate them. We parallelize different branches of Token Projection Blocks and Contextual Sampling Blocks with diverse predefined pairs of TL𝑇𝐿TLitalic_T italic_L and SR𝑆𝑅SRitalic_S italic_R. Specifically, for a branch with TLi𝑇subscript𝐿𝑖TL_{i}italic_T italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and SRj𝑆subscript𝑅𝑗SR_{j}italic_S italic_R start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, given the time series 𝐗𝐢𝐧subscript𝐗𝐢𝐧\mathbf{X_{in}}bold_X start_POSTSUBSCRIPT bold_in end_POSTSUBSCRIPT normalized by Instance Normalization [10, 4] as input, the temporal patterns through the bi-stage coarsening process can be computed as

𝐗𝐦𝐢,𝐣=CSj(TPi(𝐗𝐢𝐧)),superscriptsubscript𝐗𝐦𝐢𝐣superscriptCSjsuperscriptTPisubscript𝐗𝐢𝐧\begin{split}{\mathbf{X_{m}^{i,j}}}=\operatorname{CS^{j}}\left(\operatorname{% TP^{i}}\left(\mathbf{X_{in}}\right)\right),\\ \end{split}start_ROW start_CELL bold_X start_POSTSUBSCRIPT bold_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_i , bold_j end_POSTSUPERSCRIPT = start_OPFUNCTION roman_CS start_POSTSUPERSCRIPT roman_j end_POSTSUPERSCRIPT end_OPFUNCTION ( start_OPFUNCTION roman_TP start_POSTSUPERSCRIPT roman_i end_POSTSUPERSCRIPT end_OPFUNCTION ( bold_X start_POSTSUBSCRIPT bold_in end_POSTSUBSCRIPT ) ) , end_CELL end_ROW (1)

where TPi()superscriptTPi\operatorname{TP^{i}(\cdot)}roman_TP start_POSTSUPERSCRIPT roman_i end_POSTSUPERSCRIPT ( ⋅ ) and CSj()superscriptCSj\operatorname{CS^{j}(\cdot)}roman_CS start_POSTSUPERSCRIPT roman_j end_POSTSUPERSCRIPT ( ⋅ ) denotes the Token Projection Block with TL=TLi𝑇𝐿𝑇subscript𝐿𝑖TL=TL_{i}italic_T italic_L = italic_T italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and Contextual Sampling Block with SR=SRj𝑆𝑅𝑆subscript𝑅𝑗SR=SR_{j}italic_S italic_R = italic_S italic_R start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, respectively.

Various temporal patterns are ultimately fused to achieve a final prediction through a multi-scale merging approach. Specifically, separate predictors are used to decode the temporal patterns of different branches to achieve the desired prediction length, and the final prediction is obtained through a merging function. Each branch utilizes a pair of TL𝑇𝐿TLitalic_T italic_L and SR𝑆𝑅SRitalic_S italic_R to detect correlations at different granularity levels. The merging process can be represented as

𝐘=Merge(𝐗𝐦𝐢,𝐣),𝐘Mergesuperscriptsubscript𝐗𝐦𝐢𝐣\begin{split}{\mathbf{Y}}=\operatorname{Merge}\left(\mathbf{X_{m}^{i,j}}\right% ),\\ \end{split}start_ROW start_CELL bold_Y = roman_Merge ( bold_X start_POSTSUBSCRIPT bold_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_i , bold_j end_POSTSUPERSCRIPT ) , end_CELL end_ROW (2)

where Merge()Merge\operatorname{Merge}(\cdot)roman_Merge ( ⋅ ) is the multi-scale merging approach mentioned above. It is worth noting that we utilize the channel-independent assumption similar to that in [6]. In the following subsections, we detail the designs of comprising blocks.

3.2 Token Projection Block

The Token Projection Block provides granular information as the input to the MLP. It takes the normalized raw time series as input and outputs a preliminary future prediction through the MLP map** of constructed tokens. As shown in Fig. 1, the process begins with transforming the input sequence into overlap** consecutive coarse-grained tokens, which is to behold semantic information in the local patterns. An MLP projects these coarse-grained tokens into preliminary future predictions, providing a preliminary insight into the future trends of the time series data.

Concretely, let 𝐗𝐢𝐧I×Nsubscript𝐗𝐢𝐧superscript𝐼𝑁\mathbf{X_{in}}\in\mathbb{R}^{I\times N}bold_X start_POSTSUBSCRIPT bold_in end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_I × italic_N end_POSTSUPERSCRIPT be the input series and TL𝑇𝐿TLitalic_T italic_L the length of the coarse-grained tokens, the preliminary prediction 𝐗𝐭𝐩O×Nsubscript𝐗𝐭𝐩superscript𝑂𝑁\mathbf{X_{tp}}\in\mathbb{R}^{O\times N}bold_X start_POSTSUBSCRIPT bold_tp end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_O × italic_N end_POSTSUPERSCRIPT can be computed as

𝐗𝐭𝐩=TP(𝐗𝐢𝐧)=MLP(Conv1d(𝐗𝐢𝐧)),subscript𝐗𝐭𝐩TPsubscript𝐗𝐢𝐧MLPConv1dsubscript𝐗𝐢𝐧\begin{split}{\mathbf{X_{tp}}}={\operatorname{TP(\mathbf{X_{in}})}}=% \operatorname{MLP}\left(\operatorname{Conv1d}\left(\mathbf{X_{in}}\right)% \right),\\ \end{split}start_ROW start_CELL bold_X start_POSTSUBSCRIPT bold_tp end_POSTSUBSCRIPT = start_OPFUNCTION roman_TP ( bold_X start_POSTSUBSCRIPT bold_in end_POSTSUBSCRIPT ) end_OPFUNCTION = roman_MLP ( Conv1d ( bold_X start_POSTSUBSCRIPT bold_in end_POSTSUBSCRIPT ) ) , end_CELL end_ROW (3)

where Conv1d()Conv1d\operatorname{Conv1d}(\cdot)Conv1d ( ⋅ ) denotes a standard 1D convolution layer with its kernel size equal to TL𝑇𝐿TLitalic_T italic_L, and MLP()MLP\operatorname{MLP}(\cdot)roman_MLP ( ⋅ ) is a 2-layer perceptron realizing the global projection.

3.3 Contextual Sampling Block

The Contextual Sampling Block forms information granules from the MLP output. Consecutive time points are aggregated and down-sampled, which generates new coarse-grained representations that retain essential temporal information while effectively filter out redundant noises. As shown in Fig. 1, we first use the dilated convolution to capture contextual temporal patterns in a periodic manner, with the dilation rate is the sampling rate SR𝑆𝑅SRitalic_S italic_R, which is to better capture longer-range dependencies. In order to avoid information loss caused by zero-padding and to maintain closer connection with the historical sequence, we reuse the latest time steps from the input historical sequence as padding elements and concatenate them with the preliminary prediction from the previous block. This approach allows us to extract temporal patterns while preserve the historical information. The above process can be formulated as

𝐗𝐜𝐬=DilatedConv1d(Concat(𝐗𝐭𝐩,𝐗𝐢𝐧))subscript𝐗𝐜𝐬DilatedConv1dConcatsubscript𝐗𝐭𝐩subscript𝐗𝐢𝐧\begin{split}{\mathbf{{X}_{cs}}}=\operatorname{DilatedConv1d}\left(% \operatorname{Concat}(\mathbf{X_{tp},\mathbf{X_{in}}})\right)\\ \end{split}start_ROW start_CELL bold_X start_POSTSUBSCRIPT bold_cs end_POSTSUBSCRIPT = DilatedConv1d ( roman_Concat ( bold_X start_POSTSUBSCRIPT bold_tp end_POSTSUBSCRIPT , bold_X start_POSTSUBSCRIPT bold_in end_POSTSUBSCRIPT ) ) end_CELL end_ROW (4)

where 𝐗𝐜𝐬(I+O)×Nsubscript𝐗𝐜𝐬superscript𝐼𝑂𝑁\mathbf{{X}_{cs}}\in\mathbb{R}^{(I+O)\times N}bold_X start_POSTSUBSCRIPT bold_cs end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT ( italic_I + italic_O ) × italic_N end_POSTSUPERSCRIPT, DilatedConv1d()DilatedConv1d\operatorname{DilatedConv1d}(\cdot)DilatedConv1d ( ⋅ ) is a 1D dilated convolution layer, and Concat()Concat\operatorname{Concat}(\cdot)roman_Concat ( ⋅ ) is the specialized padding strategy reusing the historical series.

Subsequently, equispaced convolution, a type of convolution where the kernel size is equal to the stride, is employed for down-sampling contextual temporal patterns. The down-sampling does not only help preserve crucial longer-range temporal relationships from the previous step but further filters out redundant information, addressing the issue of information bottleneck related to prototype MLPs. Again with the sampling rate SR𝑆𝑅SRitalic_S italic_R, the final contextual temporal pattern can be expressed as

𝐗𝐦=EquiConv1d(𝐗𝐜𝐬)subscript𝐗𝐦EquiConv1dsubscript𝐗𝐜𝐬\begin{split}{\mathbf{X_{m}}}=\operatorname{EquiConv1d}\left(\mathbf{{X}_{cs}}% \right)\\ \end{split}start_ROW start_CELL bold_X start_POSTSUBSCRIPT bold_m end_POSTSUBSCRIPT = EquiConv1d ( bold_X start_POSTSUBSCRIPT bold_cs end_POSTSUBSCRIPT ) end_CELL end_ROW (5)

where EquiConv1d()EquiConv1d\operatorname{EquiConv1d}(\cdot)EquiConv1d ( ⋅ ) is the equispaced convolution with a kernel size SR𝑆𝑅SRitalic_S italic_R and 𝐗𝐦M×Nsubscript𝐗𝐦superscript𝑀𝑁\mathbf{X_{m}}\in\mathbb{R}^{M\times N}bold_X start_POSTSUBSCRIPT bold_m end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_M × italic_N end_POSTSUPERSCRIPT, with M=(I+O)/SR𝑀𝐼𝑂𝑆𝑅M=(I+O)/SRitalic_M = ( italic_I + italic_O ) / italic_S italic_R representing the length of the output representation. In all, by linking far-spanned temporal points and down-sampling, the two convolution layers contribute to carving out clear key patterns and function as an adequate information bottleneck.

3.4 Multi-Scale Merging

With a multi-scale setting, we integrate temporal patterns at all granularity levels. We parallelize the Token Projection Blocks and Contextual Sampling Blocks in multiple branches corresponding to diverse token lengths and sampling rates. The outputs are fused through a convolution-based merging approach. We define 𝐒={(TLi,SRj)}𝐒𝑇subscript𝐿𝑖𝑆subscript𝑅𝑗\mathbf{S}=\{(TL_{i},SR_{j})\}bold_S = { ( italic_T italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_S italic_R start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) } as the ensemble of parameters for all such branches containing a Token Projection Block and a Contextual Sampling Block (values are set separately for simplicity). Multiple branches corresponding to the pairs defined in 𝐒𝐒\mathbf{S}bold_S can be computed in parallel, generating temporal patterns at different scales. Since information granules are preserved in different branches, they are decoded independently. Specifically, separate predictors are employed to decode the temporal patterns in each branch

𝐘𝐦𝐢,𝐣=Predictor(𝐗𝐦𝐢,𝐣),superscriptsubscript𝐘𝐦𝐢𝐣Predictorsuperscriptsubscript𝐗𝐦𝐢𝐣\begin{split}{\mathbf{Y_{m}^{i,j}}}=\operatorname{Predictor}\left(\mathbf{X_{m% }^{i,j}}\right),\\ \end{split}start_ROW start_CELL bold_Y start_POSTSUBSCRIPT bold_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_i , bold_j end_POSTSUPERSCRIPT = roman_Predictor ( bold_X start_POSTSUBSCRIPT bold_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_i , bold_j end_POSTSUPERSCRIPT ) , end_CELL end_ROW (6)

where Predictor()Predictor\operatorname{Predictor}(\cdot)roman_Predictor ( ⋅ ) is a 2-layer perceptron. 𝐗𝐦𝐢,𝐣M×N,M=(I+O)/SRjformulae-sequencesuperscriptsubscript𝐗𝐦𝐢𝐣superscript𝑀𝑁𝑀𝐼𝑂𝑆subscript𝑅𝑗\mathbf{X_{m}^{i,j}}\in\mathbb{R}^{M\times N},M=(I+O)/SR_{j}bold_X start_POSTSUBSCRIPT bold_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_i , bold_j end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_M × italic_N end_POSTSUPERSCRIPT , italic_M = ( italic_I + italic_O ) / italic_S italic_R start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT is the output of a given branch of the Contextual Sampling Block. 𝐘𝐦𝐢,𝐣(I+O)×Nsuperscriptsubscript𝐘𝐦𝐢𝐣superscript𝐼𝑂𝑁\mathbf{Y_{m}^{i,j}}\in\mathbb{R}^{(I+O)\times N}bold_Y start_POSTSUBSCRIPT bold_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_i , bold_j end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT ( italic_I + italic_O ) × italic_N end_POSTSUPERSCRIPT represents a corresponding future prediction. We utilize 2D convolution to blend the branch-wise predictions 𝐘𝐦𝐢,𝐣superscriptsubscript𝐘𝐦𝐢𝐣\mathbf{Y_{m}^{i,j}}bold_Y start_POSTSUBSCRIPT bold_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_i , bold_j end_POSTSUPERSCRIPT with proper weights, which are truncated to the desired output length O𝑂Oitalic_O and compared with the ground truth in a supervised training. With the given parameters 𝐒𝐒\mathbf{S}bold_S, the final prediction is computed as

𝐘=Truncate(Conv2d(i,j)𝐒(𝐘𝐦𝐢,𝐣))𝐘Truncate𝑖𝑗𝐒Conv2dsuperscriptsubscript𝐘𝐦𝐢𝐣\begin{split}{\mathbf{Y}}=\operatorname{Truncate}\left(\underset{(i,j)\in% \mathbf{S}}{\operatorname{Conv2d}}\left(\mathbf{Y_{m}^{i,j}}\right)\right)\\ \end{split}start_ROW start_CELL bold_Y = roman_Truncate ( start_UNDERACCENT ( italic_i , italic_j ) ∈ bold_S end_UNDERACCENT start_ARG Conv2d end_ARG ( bold_Y start_POSTSUBSCRIPT bold_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_i , bold_j end_POSTSUPERSCRIPT ) ) end_CELL end_ROW (7)

where 𝐘O×N𝐘superscript𝑂𝑁\mathbf{Y}\in\mathbb{R}^{O\times N}bold_Y ∈ blackboard_R start_POSTSUPERSCRIPT italic_O × italic_N end_POSTSUPERSCRIPT represents the resulted final prediction.

4 Experiments

4.1 Multivariate Long-term Time Series Forecasting

4.1.1 Datasets

We evaluate the performance of our proposed CP-Net on seven datasets, including ETT [17] (ETTm1, ETTm2, ETTh1, ETTh2), Electricity, Traffic and Weather. These datasets have been widely used as benchmarks, whose public splits and evaluation standards are available on [13]. The statistics of those datasets are shown in Table 1.

Table 1: Statistics of seven commonly used benchmark datasets.
dataset ETTm1 ETTm2 ETTh1 ETTh2 Electricity Traffic Weather \bigstrut
variates 7 7 7 7 321 862 21 \bigstrut
time steps 69680 69680 17420 17420 26304 17544 52696 \bigstrut
granularity 15 mins 15 mins Hourly Hourly Hourly Hourly 10 mins \bigstrut

4.1.2 Baseline Models and Setup

We choose representative models from three categories as our baselines, including the CNN-based model TimesNet [12], the Transformer-based models PatchTST [6], FEDformer [18], Autoformer [13], and the Linear- and MLP-based models DLinear [15] and LightTS [16].

All of the models follow the same experimental setups with the same look-back window I=96𝐼96I=96italic_I = 96 and four prediction lengths O{96,192,336,720}𝑂96192336720O\in\{96,192,336,720\}italic_O ∈ { 96 , 192 , 336 , 720 }. We collect the baseline results from TimesNet [12] with a look-back window I=96𝐼96I=96italic_I = 96. For PatchTST, we run the officially provided code with default hyper-parameter settings and a different look-back window I=96𝐼96I=96italic_I = 96 from the original paper and reported the results. We employ the commonly used MSE and MAE as the metrics for evaluation.

4.1.3 Main Results

For multivariate long-term time series forecasting, our model outperforms or on par with the baselines on all benchmarks, as displayed in Table 2. In comparison with the current SOTA model Transformer-based PatchTST, our model with the simple architecture demonstrates superior performance with a 4.1% reduction in MSE and a 3.3% reduction in MAE. On large datasets such as Traffic and Electricity, the proposed model consistently outperforms PatchTST in all settings. Compared with the previous best CNN-based model TimesNet, our model achieves a 6.6% reduction in MSE and a 3.3% reduction in MAE. Notably on the largest Traffic dataset, our model’s MSE is 18.2% lower. In contrast to the top-performing linear model DLinear, our model surpasses it significantly with a 14.4% decrease in MSE and an 11.6% decrease in MAE. This clearly demonstrates that our coarsening strategy has effectively alleviated the drawbacks of point-wise projections and thus significantly outperforms the model in the same category.

Table 2: Multivariate long-term time series forecasting results with CP-Net and baseline models. We set the input length I=96𝐼96I=96italic_I = 96, and forecasting length O{96,192,336,720}𝑂96192336720O\in\{96,192,336,720\}italic_O ∈ { 96 , 192 , 336 , 720 }. The lower the MSE or MAE, the better the results, with the best results highlighted in bold and the second best underlined.
Models

Ours

TimesNet

PatchTST

Dlinear

LightTS

FEDformer

Autoformer

   \bigstrut
Metric

MSE

MAE

MSE

MAE

MSE

MAE

MSE

MAE

MSE

MAE

MSE

MAE

MSE

MAE

\bigstrut
ETTm1 96

0.321

0.359

0.338

0.375

0.339

0.368

0.345

0.372

0.374

0.400

0.379

0.419

0.505

0.475

\bigstrut[t]
192

0.367

0.383

0.374

0.387

0.374

0.388

0.380

0.389

0.400

0.407

0.426

0.441

0.553

0.496

336

0.400

0.406

0.410

0.411

0.406

0.405

0.413

0.413

0.438

0.438

0.445

0.459

0.621

0.537

720

0.462

0.441

0.478

0.450

0.462

0.440

0.474

0.453

0.527

0.502

0.543

0.490

0.671

0.561

\bigstrut[b]
ETTm2 96

0.176

0.259

0.187

0.267

0.176

0.260

0.193

0.292

0.209

0.308

0.203

0.287

0.255

0.339

\bigstrut[t]
192

0.241

0.300

0.249

0.309

0.242

0.303

0.284

0.362

0.311

0.382

0.269

0.328

0.281

0.340

336

0.299

0.337

0.321

0.351

0.302

0.342

0.369

0.427

0.442

0.466

0.325

0.366

0.339

0.372

720

0.397

0.394

0.408

0.403

0.399

0.396

0.554

0.522

0.675

0.587

0.421

0.415

0.433

0.432

\bigstrut[b]
ETTh1 96

0.380

0.397

0.384

0.402

0.410

0.416

0.386

0.400

0.424

0.432

0.376

0.419

0.449

0.459

\bigstrut[t]
192

0.435

0.429

0.436

0.429

0.459

0.444

0.437

0.432

0.475

0.462

0.420

0.448

0.500

0.482

336

0.479

0.450

0.491

0.469

0.500

0.465

0.481

0.459

0.518

0.488

0.459

0.465

0.521

0.496

720

0.490

0.473

0.521

0.500

0.498

0.487

0.519

0.516

0.547

0.533

0.506

0.507

0.514

0.512

\bigstrut[b]
ETTh2 96

0.291

0.343

0.340

0.374

0.302

0.348

0.333

0.387

0.397

0.437

0.358

0.397

0.346

0.388

\bigstrut[t]
192

0.367

0.392

0.402

0.414

0.388

0.400

0.477

0.476

0.520

0.504

0.429

0.439

0.456

0.452

336

0.412

0.426

0.452

0.452

0.421

0.431

0.594

0.541

0.626

0.559

0.496

0.487

0.482

0.486

720

0.420

0.440

0.462

0.468

0.429

0.445

0.831

0.657

0.863

0.672

0.463

0.474

0.515

0.511

\bigstrut[b]
Electricity 96

0.180

0.264

0.168

0.272

0.196

0.285

0.197

0.282

0.207

0.307

0.193

0.308

0.201

0.317

\bigstrut[t]
192

0.186

0.269

0.184

0.289

0.198

0.289

0.196

0.285

0.213

0.316

0.201

0.315

0.222

0.334

336

0.202

0.286

0.198

0.300

0.214

0.304

0.209

0.301

0.230

0.333

0.214

0.329

0.231

0.338

720

0.243

0.319

0.220

0.320

0.256

0.336

0.245

0.333

0.265

0.360

0.246

0.355

0.254

0.361

\bigstrut[b]
Traffic 96

0.492

0.325

0.593

0.321

0.557

0.365

0.650

0.396

0.615

0.391

0.587

0.366

0.613

0.388

\bigstrut[t]
192

0.492

0.322

0.617

0.336

0.545

0.356

0.598

0.370

0.601

0.382

0.604

0.373

0.616

0.382

336

0.506

0.328

0.629

0.336

0.555

0.359

0.605

0.373

0.613

0.386

0.621

0.383

0.622

0.337

720

0.539

0.345

0.640

0.350

0.592

0.376

0.645

0.394

0.658

0.407

0.626

0.382

0.660

0.408

\bigstrut[b]
Weather 96

0.177

0.217

0.172

0.220

0.179

0.219

0.196

0.255

0.182

0.242

0.217

0.296

0.266

0.336

\bigstrut[t]
192

0.226

0.259

0.219

0.261

0.226

0.259

0.237

0.296

0.227

0.287

0.276

0.336

0.307

0.367

336

0.281

0.298

0.280

0.306

0.280

0.298

0.283

0.335

0.282

0.334

0.339

0.380

0.359

0.395

720

0.357

0.348

0.365

0.359

0.355

0.348

0.345

0.381

0.352

0.386

0.403

0.428

0.419

0.428

\bigstrut[b]

4.2 Ablation Study

To verify the effectiveness of our proposed coarsening strategy in the model backbone, we conduct an ablation study on different variants of the model. Considering that the main purpose is to verify the coarsening module, we retain the MLP layer and independently remove the semantic coarsening and contextual coarsening modules independently. For clarity, we refer to TP and CS as the two coarsening modules. Concretely, we test the performance of the following four variants:

  • \bullet

    CP-Net: represents the standard model we propose.

  • \bullet

    w/o TP: represents removing coarsening in the TP block.

  • \bullet

    w/o CS: represents removing coarsening in the CS block.

  • \bullet

    w/o TP&CS: represents that both coarsening modules are removed only with the MLP left.

Table 3: Ablation study by removing Token Projection module and/or Contextual Sampling module on the Electricity, Traffic and Weather datasets. The best results are highlighted in bold.
Models CP-Net w/o TP w/o CS w/o TP&CS     \bigstrut
Metric MSE MAE MSE MAE MSE MAE MSE MAE \bigstrut
Electricity 96 0.180 0.264 0.186 0.270 0.212 0.289 0.216 0.293 \bigstrut[t]
192 0.186 0.269 0.190 0.275 0.203 0.285 0.208 0.290
336 0.202 0.286 0.207 0.291 0.217 0.299 0.222 0.304
720 0.243 0.319 0.248 0.324 0.260 0.332 0.264 0.337 \bigstrut[b]
Traffic 96 0.492 0.325 0.519 0.344 0.603 0.399 0.635 0.412 \bigstrut[t]
192 0.492 0.322 0.514 0.340 0.560 0.369 0.588 0.384
336 0.506 0.328 0.527 0.345 0.573 0.373 0.598 0.387
720 0.539 0.345 0.561 0.362 0.610 0.392 0.640 0.407 \bigstrut[b]
Weather 96 0.177 0.217 0.180 0.219 0.198 0.240 0.201 0.245 \bigstrut[t]
192 0.226 0.259 0.229 0.261 0.245 0.275 0.247 0.279
336 0.281 0.298 0.284 0.300 0.295 0.309 0.297 0.312
720 0.357 0.348 0.359 0.348 0.366 0.354 0.368 0.356 \bigstrut[b]
Refer to caption
Figure 2: Impact of the number of branches on the Electricity dataset. The horizontal axis (NTL,NSR)subscript𝑁𝑇𝐿subscript𝑁𝑆𝑅(N_{TL},N_{SR})( italic_N start_POSTSUBSCRIPT italic_T italic_L end_POSTSUBSCRIPT , italic_N start_POSTSUBSCRIPT italic_S italic_R end_POSTSUBSCRIPT ) represents the numbers of token lengths and sampling rates, respectively (for simplicity they are set to be identical).

We carry out the ablation study on the three largest datasets, Electricity, Traffic and Weather, and the results are shown in Table 3. Overall, the two coarsening modules together significantly improve the performance compared with the raw MLP model, with respective improvements of 10.3%, 17.6%, and 6.5% in MSE on the three datasets. Using either one independently can enhance the performance of the MLP model, which demonstrate the importance of enhancing MLP model with the effective extraction of short-term patterns. Yet, it is noticeable that while incorporating the coarsening in the TP contributes to the performance, the coarsening module in the CS block exhibits a more pronounced effect. The reason may be attributed to the fact that this module not only captures clear patterns through dilated and equispaced convolutions but also form an adequate information bottleneck that eliminates noises through the down-sampling process.

Furthermore, we conduct experiments to verify the impact of the number of branches in the backbone on the performance, taking the Electricity dataset as an example. For simplicity, we set here the number of token lengths as the number of sampling rates, and we use the optimal parametric combination in each case. As illustrated in Fig. 2, the results show that there is a noticeable improvement as the number of branches increases in the forecasting task. However, the improvement becomes less pronounced when the number of branches reaches 4. Therefore, to balance operational efficiency and performance, we ultimately choose 3 as the number of branches.

4.3 Model Analyses

4.3.1 Consistency with look-back windows.

Previous work [15, 6] shows that some Transformer-based models such as Autoformer and FEDformer degrade in the accuracy of prediction as the look-back window expands. With a longer look-back window, more useful information is exposed, such as long-term dependencies which cannot be captured within shorter ones. Therefore, a good time series prediction model should be able to make more accurate prediction as the look-back window expands. To further examine whether our model has this desirable ability, we train our model with different look-back windows I{48,96,192,336,720}𝐼4896192336720I\in\{48,96,192,336,720\}italic_I ∈ { 48 , 96 , 192 , 336 , 720 } and compare its performance with other state-of-the-art models. The results are shown in Fig. 3. Our model performs excellently across varying look-back windows, with its performance consistently improving as the look-back window expands. Moreover, we can observe that TimesNet fails to achieve better performance when look-back window expands, which demonstrate the idea 2D convolution may still lack the essence of modeling long-term dependencies.

Refer to caption
Figure 3: Forecasting performance (MSE) with varying look-back window widths I{48,96,192,336,720}𝐼4896192336720I\in\{48,96,192,336,720\}italic_I ∈ { 48 , 96 , 192 , 336 , 720 } on the Traffic, Electricity and ETTm1 datasets. The prediction length is fixed at O=96𝑂96O=96italic_O = 96.

4.3.2 Training and Inference Efficiency.

Beyond excelling performance compared with baseline models, we further demonstrate the high efficiency of the proposed model attributed to its MLP-based architecture. In this regard, we compare CP-Net with the SOTA model PatchTST and focus on the training and inference time. For fair comparisons, we conduct experiments using the data loader in Autoformer [13]. We utilize a batch size of 8 for the Electricity dataset with 321 individual time series, resulting in batches of size 8 × 321 × I (with I being the varying width of the look-back window). We report the inference time per batch and the training time for one epoch for CP-Net and PatchTST respectively as the look-back window (I) varies from 96 to 2880, as shown in Fig. 4. Our model is significantly superior to PatchTST regarding both training and inference speeds with less fluctuation. Besides, PatchTST runs out of memory when I2880𝐼2880I\geq 2880italic_I ≥ 2880. It can thus be concluded that with the architectural simplicity, our model achieves a superior or comparable accuracy with improved computational and memory efficiency compared to models based on the attention mechanism. All experiments for this runtime comparison were conducted using a single NVIDIA RTX3090Ti GPU on the same machine.

Refer to caption
Figure 4: Comparison of training and inference time against PatchTST based on the attention mechanism as one of the state-of-art models on the Electricity dataset. Note that PatchTST encountered GPU memory exhaustion for the look-back window width I2880𝐼2880I\geq 2880italic_I ≥ 2880.

5 Conclusion

We present the CP-Net, a novel time series forecasting model featured by an effective boosting strategy for MLPs. It employs a coarsening scheme that addresses two critical problems of the point-wise projection of the MLP layer, known as the deficient contextual dependencies and inadequate information bottleneck. At the core of the CP-Net lies a multi-scale coarsening strategy composed of Token Projection Blocks and Contextual Sampling Blocks that integrate functions prior and posterior to the MLP. By forming diverse information granules, the CP-Net exhibits two key advantages as being able to grasp essential patterns in the time series and meanwhile filtering out unwanted volatile information, both falling short with a sole MLP layer. In comparison with typical convolution-based models lacking the long-term modeling capacity and Transformer-based models with a higher computational complexity, our model is able to comprehend both local and global temporal correlations while maintains a linear computational complexity. With its architectural simplicity, our extensive experimental results show that CP-Net outperforms or is on par with the SOTA models across nearly all adopted empirical datasets, which demonstrates the effectiveness of the proposed coarsening strategy in alleviating the intrinsic problems of MLPs and enhancing their modeling capability.

References

  • [1] Ariyo, A., Adewumi, A., Ayo, C.: Stock price prediction using the ARIMA model. In: International Conference on Computer Modelling and Simulation (2014)
  • [2] Bai, S., Kolter, J., Koltun, V.: An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv preprint arXiv:1803.01271 (2018)
  • [3] Challu, C., Olivares, K.G., Oreshkin, B.N., Ramirez, F.G., Canseco, M.M., Dubrawski, A.: N-HiTS: Neural hierarchical interpolation for time series forecasting. In: AAAI Conference on Artificial Intelligence (2023)
  • [4] Kim, T., Kim, J., Tae, Y., Park, C., Choi, J.H., Choo, J.: Reversible instance normalization for accurate time-series forecasting against distribution shift. In: International Conference on Learning Representations (2021)
  • [5] Liu, M., Zeng, A., Chen, M., Xu, Z., Lai, Q., Ma, L., Xu, Q.: SCINet: Time series modeling and forecasting with sample convolution and interaction. In: Neural Information Processing Systems (2022)
  • [6] Nie, Y., Nguyen, N.H., Sinthong, P., Kalagnanam, J.: A time series is worth 64 words: Long-term forecasting with transformers. In: International Conference on Learning Representations (2023)
  • [7] Oreshkin, B., Carpov, D., Chapados, N., Bengio, Y.: N-BEATS: Neural basis expansion analysis for interpretable time series forecasting. In: International Conference on Learning Representations (2019)
  • [8] Shabani, A., Abdi, A., Meng, L., Sylvain, T.: Scaleformer: iterative multi-scale refining transformers for time series forecasting. In: International Conference on Learning Representations (2023)
  • [9] Taylor, S., Letham, B.: Forecasting at scale. The American Statistician 72(1), 37–45 (2018)
  • [10] Ulyanov, D., Vedaldi, A., Lempitsky, V.: Instance normalization: The missing ingredient for fast stylization. arXiv preprint arXiv:1607.08022 (2016)
  • [11] Wang, H., Peng, J., Huang, F., Wang, J., Chen, J., Xiao, Y.: MICN: Multi-scale local and global context modeling for long-term series forecasting. In: International Conference on Learning Representations (2023)
  • [12] Wu, H., Hu, T., Liu, Y., Zhou, H., Wang, J., Long, M.: TimesNet: Temporal 2d-variation modeling for general time series analysis. In: International Conference on Learning Representations (2023)
  • [13] Wu, H., Xu, J., Wang, J., Long, M.: Autoformer: Decomposition transformers with auto-correlation for long-term series forecasting. In: Neural Information Processing Systems (2021)
  • [14] Yi, K., Zhang, Q., Fan, W., Wang, S., Wang, P., He, H., An, N., Lian, D., Cao, L., Niu, Z.: Frequency-domain MLPs are more effective learners in time series forecasting. In: Neural Information Processing Systems (2023)
  • [15] Zeng, A., Chen, M., Zhang, L., Xu, Q.: Are transformers effective for time series forecasting? In: AAAI Conference on Artificial Intelligence (2023)
  • [16] Zhang, T., Zhang, Y., Cao, W., Bian, J., Yi, X., Zheng, S., Li, J.: Less is more: Fast multivariate time series forecasting with light sampling-oriented MLP structures. arXiv preprint arXiv:2207.01186 (2022)
  • [17] Zhou, H., Zhang, S., Peng, J., Zhang, S., Li, J., Xiong, H., Zhang, W.: Informer: Beyond efficient transformer for long sequence time-series forecasting. In: AAAI Conference on Artificial Intelligence (2021)
  • [18] Zhou, T., Ma, Z., Wen, Q., Wang, X., Sun, L., **, R.: FEDformer: Frequency enhanced decomposed transformer for long-term series forecasting. In: International Conference on Machine Learning (2022)