[1]\fnmAysin \surTumay
1]\orgdivDepartment of Electrical and Electronics Engineering, \orgnameBilkent University, \cityAnkara, \postcode06800, \countryTurkey
Hierarchical Ensemble-Based Feature Selection for Time Series Forecasting
Abstract
We introduce a novel ensemble approach for feature selection based on hierarchical stacking for non-stationarity and/or a limited number of samples with a large number of features. Our approach exploits the co-dependency between features using a hierarchical structure. Initially, a machine learning model is trained using a subset of features, and then the output of the model is updated using other algorithms in a hierarchical manner with the remaining features to minimize the target loss. This hierarchical structure allows for flexible depth and feature selection. By exploiting feature co-dependency hierarchically, our proposed approach overcomes the limitations of traditional feature selection methods and feature importance scores. The effectiveness of the approach is demonstrated on synthetic and the well-known real-life datasets, providing significant scalable and stable performance improvements compared to the traditional methods and the state-of-the-art approaches. We also provide the source code of our approach to facilitate further research and replicability of our results.
keywords:
feature selection, ensemble learning, the curse of dimensionality, hierarchical stacking, light gradient boosting machine (LightGBM), time series forecasting.1 Introduction
We study feature selection for time series regression/prediction/forecasting tasks for settings where the number of features is large compared to the number of samples. This problem is extensively studied in the machine learning literature as it relates to the infamous “curse of dimensionality” phenomenon, which suggests that machine learning models tend to struggle in cases where the number of samples is not sufficient given the number of features for effective learning from data [2, 30]. This results in over-training to obtain a model with high variance, i.e., low generalization ability [11]. This feature selection problem is even more prominent in non-stationary environments, e.g., for time series data or drifting statistics, where the trend, or the relationship between the features and the desired output changes significantly over time, making it challenging to identify relevant features.
Generally, the problem is addressed by i) considering all subsets of the features such as wrappers [16], ii) using feature characteristics such as filters [23], iii) embedding feature selection into the model learning process such as Lasso Linear Regression [28], Random Forest [6], and iv) feature extraction methods such as Principal Component Analysis (PCA) [22]. However, these methods are not effectively utilizing the information of the data, e.g., simply exploiting the “dominant” features without exploring others, and/or not scalable in reduction as the number of dimensions grows, i.e., computationally inefficient or not dynamic enough to adapt well as the domain of the features varies depending on the task. These widely used methods are also univariate, considering each feature once in calculating their importance. Directly evaluating all the subsets of the features for a given data becomes an NP-hard problem as the number of features grows, rendering such techniques, e.g., wrappers, computationally infeasible since the number of subsets reaches over a billion when the number of features exceeds thirty. Embedding selection methodology into modeling, e.g., feature selection based on feature importance scores is also inadequate since, with limited data, these scores are unreliable giving vague explanations about gain or split-based selection for tree-based models [20]. One possible solution is to use ensemble or bagging techniques, where different machine learning algorithms are trained on different subsets of the feature vectors. However, this approach also leads to losing co-dependency information between features. Lastly, unsupervised feature extraction techniques such as PCA suffer from not incorporating valuable information from the underlying task where the original task, be it regression or classification, is supervised.
Here, we introduce a highly effective and versatile hierarchical stacking-based novel ensembling approach to this problem, where we first train an initial machine learning model using a subset of the full features and then a second machine learning model in the hierarchy takes the remaining features as inputs using only the outcome of the first model on the cost function. The hierarchy increases until either the features exhaust and/or a user-controlled hierarchy depth is reached. Therefore, this generic structure allows for the depth of the hierarchy to be a design parameter, as well as the features used in each layer. By exploiting the co-dependency between features in a hierarchical manner, our approach addresses the limitations of traditional feature selection methods and provides more reliable results than feature importance scores. We build upon the substantial work demonstrated in the ensemble feature selection domain and illustrate the success of our approach on the synthetic and well-known real-world datasets in terms of accuracy, robustness, and scalability.
2 Related Work
Filters [8] and wrappers [16] are the prevalent traditional feature selection methods in use. While the former utilizes statistical tests such as chi-square, information gain, and mutual information, wrappers recursively do forward selection or backward elimination to the current feature subset according to an evaluation metric assigned to the model. The forward selection has a hard time finding good co-predictors while it is faster than backward elimination resulting in better scalability to larger datasets. In situations where wrappers overfit, filters are used with the knowledge of statistical tests [8]. Filters are fast in computation, easy to be scaled to higher dimensional datasets, and independent of the model while the dependencies between features are ignored. Wrappers, on the other hand, interact with the model and model the feature dependencies while they tend to require more computational resources since all feature subsets are tedious to try compared to filters. Saeys et al. [25] suggest more advanced methods such as an ensemble of feature selection methods and deep learning for feature selection. Moreover, Hancer [13] proposes a wrapper-filter feature selection method using fuzzy mutual information that overcomes standard mutual information’s limitations. Our approach differentiates from wrappers, filters, and the methods of Saeys et al. and Hancer since we propose a multivariate solution that processes the groups of features by leveraging the codependency in the groups of features.
Boosting methods in feature selection as suggested by Das [8] are equipped with boosted decision trees where the metric is information gain and weak learners are decision stumps. They perform well with the help of increasing the weight of each high-loss decision stump. This paper also proposes a hybrid method that uses a filter method for initial feature ranking and selection, followed by a wrapper method that evaluates the selected features using a classification model, combining the high accuracy of wrappers and time efficiency of filters [25]. Even though the hybrid model of wrappers and filters generates a more efficient model, our method is more time-efficient and utilizes the statistical importance of features as well as the prior knowledge of the side information. On top of these, the gradient boosting models are not generic for every loss function since these models require the hessian of the loss function to be nonzero. Our approach can integrate any external loss function in the middle of the system, independent of the boosting algorithm, which brings a novel solution to the problem.
The methods such as feature importance extracted by SVM [7] or Random Forest [6] algorithms and unsupervised feature extraction methods are not task-specific and might be inadequate to exploit domain knowledge in many areas. As for leveraging the ensemble approach, Saeys et al. [24] investigate the impact of four different techniques of feature selection namely Symmetric Uncertainty [19] for univariate cases, RELIEF [29] for multivariate cases, feature importance measures of Random Forest and SVMs, and finally SVM-based Recursive Feature Elimination [26]. Besides, the ensembles from a bagging point of view are made with perturbations for each technique. The classification performance and robustness are merged into an F1 measure with a custom weight which results in choosing an ensemble of Random Forest [24]. While feature importance scores of tree-based models bring insight into the dataset, they ignore the correlations between features since each feature is individually analyzed. As an improvement on the RELIEF algorithm, Škrlj et al. [33] developed an approach of ReliefE that suits high-dimensional sparse input spaces. Their solution lies in adapting manifold-based embeddings of feature and target space to multi-class classification problems. Although this method proposes a context-dependent space complexity, we enable a determined smaller complexity depending on the size of the input data and the number of layers in the iterated algorithm.
One of the commonly used feature selection methods is a tree ensemble of Random Forest to determine a threshold to slice the feature subset based on the average information gain of each node [3]. There are also threshold determination methods using data complexity measures. Seijo-Pardo et al. discuss an automatic method to determine the threshold, unlike other methods that are task-specific [27]. This method takes the weighted average of the complexity measures such as the Maximum Fisher discriminant ratio (F1), the volume of overlap region (F2), maximum feature efficiency (F3), complexity measures (CF) and the percentage of features retained. After different ranking methods from filters and embedded models are combined with the min-combination method, which selects the minimum of the relevance values coming from each ranking [4], the threshold is determined with one of the complexity measures. Then, the classification is completed with the selected subset of features. Another design given in this work applies thresholding to each ranking method. Having the thresholded sets, the rest is the same as the first design. All in all, an automatic thresholding method performs better than fixed thresholding methods [27]. Although Random Forest with a varying thresholding mechanism may work better than ordinary model, the issue of univariate feature importance scores, which we tackle thoroughly by kee** informative features together, is not solved in this proposal. Another novel method proposed by Fumagalli et al. [12] is incremental permutation feature importance (iPFI) which is an online variation of the batch permutation method offering two sampling strategies to calculate marginal feature distributions. However, the lack of human-grounded experiments is compensated in our simulations. Additionally, the algorithm lacks in considering co-dependency between feature subsets, indicating univariance.
The novel approach suggested by Jenul et al. [14] parallels our aim of incorporating domain knowledge as well as the dominant data itself. They employ a user-guided ensemble feature selector which includes the likelihood approach, prior weights from expert knowledge, and side constraints as regularization. Afterward, these components are combined with an optimization rule. Although our approach echoes this method in prior weights in expert knowledge and the optimization loop, we overcome the limitations of loss functions. We employ a flexible optimization loop which extends to any possible customized loss.
Overall, the contemporary literature addresses the feature selection either in a disjoint manner, i.e., in a model/domain-independent fashion or via favoring the seemingly dominant features while failing to explore majority of the features. Unlike previous studies, our model as the first time in the literature fully exploits the co-dependency between features via a novel hierarchical ensemble-based approach. This allows for an adaptive, i.e., dynamic feature selection, which is especially useful in nonstationary environments, e.g., in time series settings. The introduced architecture is generic, i.e., both the depth of the hierarchy and the base models employed are user-controlled. As shown in our simulations, the introduced model provides significant improvements in performance as well as scalability over the well-known real life competition datasets compared to the traditional as well as the state-of-the-art feature selection approaches. We publicly share the implementation of our algorithm for both model design, comparisons, and experimental reproducibility 111https://github.com/aysintumay/hierarchical-feature-selection.
3 Problem Description
All vectors in this paper are column vectors in lowercase boldface type. Matrices are denoted by uppercase boldface letters. Specifically, represents a matrix containing , i.e., the sequence of vector , and in the column for each time . denotes the element of in the row and column. Ordinary transposes of and are denoted as and , respectively. The mean and standard deviation of , i.e., the dimension of , are denoted by and , respectively. The covariance of a time series with its times delayed version is represented as . The gamma function, having the property of generalizes the concept of a factorial to real and complex numbers.
We study feature selection in time series prediction of a sequence . We observe this target sequence along with a side information sequence (or feature vectors) each of which is of size . At each time , given the past information , and for , we produce the output . Hence, in this setting, our goal is to find the relationship
where is an unknown function of time, which models . We introduce a hierarchical nonlinear ensemble model for with an efficient feature selection procedure integrated in. Throughout the training of the model, we suffer the cumulative loss
where is the number of data points and can be, for example, the squared error loss, i.e., .
To exemplify the problem and illustrate the significance of for some values, i.e., lagged target sequence among features, we give the example of wind energy production prediction [17]. This task provides information from and features generated from it, also called -related sequences, and weather conditions. Weather-based features constitute the side information, indirectly related to . As shown in the wind prediction literature review [17], the -related features are the dominant features. The predictions of the model that utilize both -related features, as well as weather information, largely follow the long-term patterns of the former, i.e., the weather-based side information is hardly utilized as illustrated in Section 5. Additionally, this model is prone to overfitting due to its high dimensionality [11]. We aim to effectively incorporate weather-based features into the model, as they can capture short-term abnormalities caused by -related features. In this sense, our approach initially shapes the long-term patterns with dominant feature vectors. Then, it fine-tunes them with the side information through cost optimization, resulting in using the side information and generating a multivariate solution. Note that this phenomenon happens in most real-life applications, where due to the dominance of certain features, the other features are hardly utilized.
Formally, at time step , we have a matrix with a length and dimensionality , i.e., the data points of form an Euclidean space of -dimensions. Considering this hypersphere, we generalize the distance between data points in the -dimensional hypersphere as , where is the radius of the hypersphere. If approaches , the volume of the hypersphere, , given by:
(1) |
increases exponentially, leading to a sparser data distribution characterized by a larger [1].
We suffer data sparsity also due to non-stationarity, which is generally caused by trends and seasonality, while conducting time series forecasting for target sequence . As such trends and seasonality reduce the relative number of unique data points, which can lead to overfitting.
To this end, we propose a hierarchical ensemble-based feature selection method for the time series forecasting task to overcome overfitting and non-stationarity. We split into non-intersecting feature subsets based on domain knowledge or using certain heuristics such as feature importance metrics as demonstrated in our experiments. After this split, we train machine learning models, in a dependent manner, that take each as input in a hierarchical order by optimizing a cost function after each model until the one. Before describing our approach, we discuss the well-known approaches to this problem for completeness in the following.
3.1 Common Approaches
There are many approaches for circumventing the curse of dimensionality such as wrapper-based, embedded, filtering, and ensemble methods. Let us denote a subset of the full feature vector as , , i.e., given features, there exist different subsets of the feature set, which may include one or more features from .
3.1.1 Wrapper-based Methods
As a greedy method, the well-established wrappers have an optimization objective over the validation loss that finds the best-performing feature subset as follows:
(2) |
subject to , where and are the target sequence and the feature subset of the validation set between , is a machine learning model trained on the feature subset with parameters , and is the loss function of the model. The algorithm can iterate through each feature subset seeking to maximize the performance of the machine learning model on the validation set. Naturally, due to computational complexity issues, wrappers are hardly used in a complete form in real-life applications in most cases.
3.1.2 Embedded Methods
As another approach in the literature, embedded methods perform feature selection as a part of the model construction process. Examples of these methods include Random Forests, and Gradient Boosting Trees. Tree-based models eliminate features once or recursively (also known as Recursive Feature Elimination) based on their feature importance rankings and uses the remaining ones for training. One drawback of this method is that it is univariate, considering one feature at a time while calculating the scores.
3.1.3 Filtering Methods
Another traditional method of feature selection is filtering. Unlike other methods, filtering relies on statistical measures instead of using machine learning algorithms. For instance, the score of the Pearson Correlation Coefficient [21] between and for is .
Filtering algorithm can incrementally form an optimal set in terms of correlation with feature vectors from based on maximum dependency. This is accomplished by discarding the lowest correlation score giving feature vectors from in each iteration. On the other hand, filtering methods do not inherently incorporate domain knowledge and cannot measure nonlinear dependency since they solely rely on statistical measures. Moreover, they are univariate, calculating the score of each feature one by one.
3.1.4 Ensemble-based Methods
In this method, the predictions of machine learning models, also called base learners, are directly used to determine the weight vector that ensembles the base learners, as shown in Figure 1. The version with two base learners is demonstrated in Algorithm 1. All base learners take different feature subset vectors as input. Combining the predictions of base learners, denoted as , the ensemble prediction is found as follows:
(3) |
where is the base prediction vector of machine learning models, and is the ensembling coefficient vector . With affine-constraint optimization, the loss of ensemble models is subject to :
(4) |
The optimization hyperspace is -dimensional when the component of complements the sum of the entire vector to be 1. Therefore, the subject of the minimization changes into . In this sense, conventional ensemble methods can be computationally expensive, and more significantly, since they are independently trained, they cannot exploit the co-dependency between feature subsets.
4 Hierarchical Ensemble-Based Approach
To overcome the limitations of traditional feature selection methods, we propose a hierarchical ensemble-based approach involving distinct machine learning models organized into levels. Figure 2 illustrates two sample successive layers of the structure. Each machine learning model takes the output of the previous layer ( in Figure 2). Then, the latter model ( in Figure 2) generates the predictions of the optimized weights that scale the last output. Each model operates on a different subset of , which may include exogenous information, also called the side-information, and features derived from the past information of the target sequence , , denoted as .
While the hierarchical ordering of the models is guided by domain expertise, we propose that the first level should exclusively comprise (and features derived using them, e.g., their rolling means) referred to as -related features as shown in different time series prediction papers [32, 18]. The reason is that these past target values exhibit higher importance scores than the side information sequences, thereby dominantly influencing predictions. In fact, many machine learning-based time-series models suffer from “overfitting” to the -related features and ignore most of the features [10]. Subsequent levels can incorporate the side information sequences for . We next describe the layers in the hierarchy and the optimization procedure thereof.
4.1 Description of Layers
To explicate the layers composing the introduced architecture, in Figure 2, we present a snapshot of the generic model with all the major components in action; these represent the core operations in the overall hierarchy which are repeated in succession. In Figure 2, we observe four main components in transitioning from layer to : two machine learning models (left and middle right), a cost optimization function (middle left), and a linear superposition function (right). The leftmost machine learning model, , is fed with restricted side information sequence , which corresponds to the “feature bagging” technique for addressing the bias-variance trade-off [6], as well as with the refined predictions of the previous layer, which we will elaborate on in the following. There is no restriction on what this learning model could be, nor on the loss function it aims to minimize. After the learning is finished, the predictions of the model are acquired and passed onto the cost optimization block (middle left part in Figure 2). Therein lies the novelty of our algorithm, as we, unlike the usual boosting procedure, e.g., LightGBM uses, do not transmit these predictions as is to the next model in the chain but instead subject it to a weighting. To minimize the final loss, the cost optimization function generates a weight sequence that scales the previous prediction sequence, . Another novelty of this method lies in the compatibility of any loss function in the cost optimization stage, which extends on the limitations of loss functions with non-practical second derivatives, e.g., the L1 loss. The details of the optimization of s are given in Section 4.2.
For context-awareness, we further feed these optimized s into another learning model, , which uses the current context for its training; we acquire the context-aware weights out of it. Then in the last superposition stage, a linear function scales the prediction of the leftmost model with s to obtain the refined predictions, which are then fed to the next block in the series. The chain of blocks continues this way until hits the user-defined hierarchy size parameter . This flow of the model is depicted in Algorithm 2.
Formally, in the transition from to layer of the algorithm, we first generate model’s predictions for each time , denoted as , where , inputting into base learner . Subsequently, we deduce the weight sequence to refine by optimizing iteratively in a loop as depicted in Algorithm 2 for:
(5) |
where for each time . As depicted in the middle left pink block in Figure 2, we determine with an optimization loop to minimize the loss at each time in layer as,
(6) |
In general, need not be differentiable. Unlike ensemble methods, we optimize with and by scaling at each time with . The cost optimization step is further elaborated in Section 4.2. In this sense, in the layer, we update the output of the layer using the features that belong to the layer. Therefore, every subset of features contributes to minimizing the final error similar to boosting or stacking.
As shown in the middle right block in Figure 2, we train the consequent model ( in Figure 2) with and to predict the ultimate weights , which are context-aware. In the green block at the end, we modify using to incorporate the side information, leveraging the learned patterns from the error of the previous layer with a linear superposition function , as represented by
(7) |
In the following section, the cost optimization step between layers 1 and 2 in Algorithm 2 is elaborated.
4.2 Cost Optimization
The cost optimization refers to the iterative approach, in which we employ from a determined range to modify in any layer . We aim to find for each time . Finally, we have the following optimization objective:
(8) |
where can be any loss subject to for each time such that if a weight is in the range the prediction is effectively downscaled, and similarly for the range , we employ upscaling to leverage robustness. This procedure aims to be flexible so that the algorithm can be scaled to any domain-specific loss function. The “Cost Optimization” section in Algorithm 2 and the leftmost pink block in Figure 2 show the structure of the cost function.
To understand the contribution of our cost optimization approach, we compare it with the loss structure of a powerful tree-based model, LightGBM [15]. During the training process, LightGBM iteratively updates the model by minimizing the chosen cost function. This process is typically performed using gradient-boosting techniques. Choosing the base model of LightGBM as a decision tree, the leaf split finding operation is completed with the high-level insight provided by the hessian and gradient of the loss function. Defining the gradient as
(9) |
and the hessian is defined as
(10) |
If we use the customized loss function of LightGBM, it would calculate the gradient and hessian of with respect to to determine the direction and magnitude of the updates. As an example, the negative gradient for L1 loss is given by
(11) |
which leads to the hessian being impractically 0, in a form that LightGBM cannot natively process. Therefore, the hessian of the loss function should be nonzero while working with a custom function.
In our application, we bring another approach to make it convenient to embed any custom loss into our base learners, e.g., LightGBM. In our case, we define a value, which is also a tuned hyperparameter in the range of , for to be the higher limit and be the lower limit of the chosen value. As depicted in Algorithm 2, we find the optimal in a greedy process aiming to minimize any custom loss function, directly for each time step . Therefore, the objective in (6) is employed. After this step, vector is inputted to the following model depicted as in Figure 2.
By iterating the process in Algorithm 2, we can effectively search for the best combination of features and capture the correlation between them. For the sake of improving traditional feature selection methods, the iterative process allows us to optimize models simultaneously, as the predicted output from one model is used to enhance the other model.
In the next section, we illustrate the performance of our hierarchical ensemble-based model on synthetic and widely known real-life datasets.
5 Simulations
In this section, we illustrate the performance of our hierarchical ensemble-based model with 2 layers, i.e., , in comparison with other models on various well-known time series datasets. Initially, we introduce the models used for comparison. Then, we provide the performance of our model.
5.1 Compared Models
The simulations include 5 models that are labeled as Wrapper, Ensemble, Hierarchical Ensemble, Embedded, Baseline LightGBM. The first compared method, Wrapper, which is described in Section 3, discards the feature that gives the least contribution to the model based on the L2 loss in each iteration. The Embedded model described in Section 3 only uses and features derived from , e.g. rolling features, namely as . The reason that -related features are employed in the simulation as input to another model is to investigate if the most important features are enough without requiring domain knowledge. The Ensemble model works based on Algorithm 1 with 2 baseline models, mixing and with which is chosen in an iterative process minimizing the L1 objective. Lastly, Baseline LightGBM model refers to the model that uses the whole . With this model, we seek to find if feature selection for the datasets is necessary at all. Moreover, we expect to observe overfitting due to the high dimensionality and non-stationarity.
The evaluation metric used in the experiments is the mean square error. All experiments are iterated 200 times to ensure the reliability of the results. The synthetic dataset is generated 200 times with a random Gaussian noise. For the real-life dataset, we randomly sampled 200 out of 414 series of the hourly M4 Forecasting, which is the widely publicized competition dataset [31]. We obtained the cumulative sum of error between and for the experiment as follows:
(12) |
Then, the average over the 200 trials is taken so as to eliminate the bias due to using a particular sequence as,
(13) |
Finally, the cumulative sum over time is taken to smoothen the results as follows:
(14) |
About the experiment settings, the hyperparameter in the cost optimization step depicted in Algorithm 2 is fixed to 0.33 for both experiments. Therefore, the range of is .
In the next sections, we give analysis and experiments of synthetic and real-life datasets. As we simulate with models, we denote as ; likewise as .
5.2 Synthetic Dataset
The data is generated with an autoregressive moving average (ARMA) process of order , i.e.,
(15) |
where the autoregressive part is represented by the lagged values up to 4 times, and the moving average is represented by the lagged error terms up to 5 times. The and variables control the strength of the lags [5]. In our setting, , and . The Augmented Dickey-Fuller [9] test reveals the p-value as 0.2116, showing non-stationarity. Then, the series is transformed with a min-max scaler.
We first generated the domain knowledge-representing feature subset . For that, a binary classification feature set is designed with fully informative 26 features. The synthetically generated binary sequence has a class imbalance of . The values obtained from (15) that have indexes in the corresponding label 1 in are multiplied by to upscale, while others are multiplied by to downscale. Therefore, we guarantee dependence on the generated domain knowledge series, which can provide a substantial amount of pattern. As the final step of generation, we added a Gaussian normal noise of . To simulate the curse of dimensionality problem, the number of total features is set to 36, as demonstrated in Table 1, and most of the features are informative. The average p-value is also higher than , which validates that our experiment settings are suitable for non-stationarity.
Upon generating the synthetic dataset with high dimensionality and non-stationarity, our dataset is well-suited to the problem statement in Section 3.
Based on the performance plot in Figure 3, we highlight that our proposed method outperforms other compared models while the overall loss trend is decreasing. We confidently prompt that the -related features, given to , find most of the patterns in the label while finetunes the short-term patterns, as desired. Moreover, the descending trend indicates the robustness of our method. The Wrapper method is significantly the worst-performing model based on the MSE scores since it could not converge in a global optimum, using a suboptimal feature subset. From Table 2, the extreme time consumption of this method compared to Hierarchical Ensemble also verifies the time efficiency of our model. Hence, Wrapper is the least efficient model among other compared methods. Additionally, Embedded and Ensemble models are significantly close in loss. One can say that Embedded model generates long-term patterns successfully with -related features. Therefore, Ensemble chooses -related compared to date-related features. Even though the optimal feature subset is found by Embedded, it is outperformed by Hierarchical Ensemble since our method fully incorporates the information provided by codependent feature pairs rather than deciding whether to choose -related or domain knowledge features. Moreover, Baseline LightGBM performs the closest to the proposed approach, which indicates the level of informativeness of the features. However, our proposed approach outperforms Baseline LightGBM since we overcome overfitting by training multiple models with less number of features while the other is overfitted easily due to a large number of features.
Dataset | Sample Size | Feature Size | Average p-value |
---|---|---|---|
Synthetic | 500 | 36 | 0.2376 |
M4 Hourly Forecasting | 1001 | 24 | 0.36856 |
5.3 M4 Competition Datasets
Here, the hourly M4 dataset is used as a real-life dataset, which includes 414 different time series data [31]. About the structure of the series, the M4 competition dataset does not include date-time indexes. Hence, we give the indexes externally to the dataset to extract date-related features. The sample size of the train set was reduced to 953 while the test set includes 48 samples as demonstrated in Table 1. Investigating regarding the stationarity test; we found the p-value to be on average, which indicates non-stationarity, as demonstrated in Table 1. We give the -related features to to be consistent with the synthetic dataset. We transformed the desired data through a min-max scaler. In this setting, we took the , , and lags of the desired data to generate the mean and standard deviation of the rolling window feature. The date time features are included in . The date-related features are the cosine and sinus vectors of the hour, day of the month, day of the week, month of the year, quarter of the year, and week of the year.
Based on the performance of the proposed method in the M4 hourly dataset in Figure 4, the Hierarchical Ensemble outperforms other models. We highlight that the -related features given in a higher level of hierarchy form the long-term patterns while the feature subset containing domain knowledge easily upscales or downscales the long-term patterns. Moreover, we demonstrate a more robust performance by the stationary cumulative error compared to other methods which are either oscillating. The reason that Baseline LightGBM performs worse than our method is because of the high dimensionality causing overfitting. Although the feature size is less than the synthetic experiment, the Baseline LightGBM memorizes the unique patterns in the train set due to the highly informative features.
The Embedded and Ensemble methods perform close to each other since the Ensemble model generally chooses the Embedded model giving of 1 among the range of . It’s important to note that the Embedded utilizes a more dominant feature set compared to date-related features, which indicates the inclination of Baseline LightGBM to -related features. Our proposed method overcomes this issue by finetuning with the less dominant features in the last level for a greater impact on . Moreover, The Wrapper method clearly converges to a non-optimal local minimum since it performs the worst. It also demonstrates the least robust MSE performance due to oscillatory behavior in contrast to our method. From Table 2, the extreme time consumption of this method compared to Hierarchical Ensemble also verifies the efficiency of our model. As our proposed method performs the best, we conclude that the date-related features modify the first layer predictions successfully, solving overshooting or undershooting problems. Overall, the proposed method works satisfactorily in the real-life dataset outperforming other models.
Dataset | Wrapper | Hierarchical Ensemble |
---|---|---|
Synthetic | 393.8 | 9.3552 |
M4 Hourly Forecasting | 75.24 | 17.46 |
6 Conclusions
In this work, we proposed an ensemble feature selection method based on hierarchical stacking.
On top of the important milestones of traditional stacking methods, our approach leverages a hierarchical structure that fully exploits the co-dependency between features. This hierarchical stacking involves training an initial machine learning model using a subset of the features and then updating the output of the model using another machine learning algorithm that takes the remaining features or a subset of them to
adjust the first layer predictions while minimizing a custom loss. This hierarchical structure provides novelty by allowing for flexible depth in each layer and suitability to any loss function. We demonstrate the effectiveness of our approach on the synthetic and M4 competition datasets. Overall, the proposed hierarchical ensemble approach for feature selection offers a robust and scalable solution to the challenges posed by datasets with high dimensional feature sets compared to the sample size. Effectively capturing feature co-dependency and showcasing enhanced accuracy and stability in machine learning
models, our method outperforms traditional and state-of-the-art machine learning models.
We also provide the source code of our approach to facilitate further research and replicability of our results.
Acknowledgements This work is in part supported by the Turkish Academy of Sciences Outstanding Researcher Program.
Author Contributions AT: Conceptualization, Methodology, Software, Investigation, Resources, Validation, Visualization, Writing- original draft, review.
MEA: Conceptualization, Investigation, Resources, Validation, Writing- original draft, review.
ATK: Conceptualization, Investigation, Supervision, Validation, Writing, review, editing.
SSK: Conceptualization, Investigation,
Methodology, Supervision, Validation, Writing, review, editing.
Funding Not applicable.
Data availability The synthetically generated data that support the findings of this study is available at https://github.com/aysintumay/hierarchical-feature-selection. The extraction process of the synthetic dataset is explained in the paper in detail. The real-life data that support the findings of this study is openly available at https://www.kaggle.com/datasets/yogesh94/m4-forecasting-competition-dataset.
Code Availability The code of this work is publicly available on https://github.com/aysintumay/hierarchical-feature-selection.
Declarations
Competing Interests: The authors declare that they have no competing financial or non-financial interests that could have influenced the presented work in this paper.
Ethics Approval: The authors approve that there are no ethical concerns related to the work presented in this paper.
Consent to Participate: All authors agreed on the content and explicitly consented to the submission of the paper.
Consent for Publication: All authors who participated in this study give the publisher permission to publish this work.