[1]\fnmAysin \surTumay

1]\orgdivDepartment of Electrical and Electronics Engineering, \orgnameBilkent University, \cityAnkara, \postcode06800, \countryTurkey

Hierarchical Ensemble-Based Feature Selection for Time Series Forecasting

[email protected] \fnmMustafa E. \surAydin [email protected] \fnmAli T. \surKoc [email protected] \fnmSuleyman S. \surKozat [email protected] [

Abstract

We introduce a novel ensemble approach for feature selection based on hierarchical stacking for non-stationarity and/or a limited number of samples with a large number of features. Our approach exploits the co-dependency between features using a hierarchical structure. Initially, a machine learning model is trained using a subset of features, and then the output of the model is updated using other algorithms in a hierarchical manner with the remaining features to minimize the target loss. This hierarchical structure allows for flexible depth and feature selection. By exploiting feature co-dependency hierarchically, our proposed approach overcomes the limitations of traditional feature selection methods and feature importance scores. The effectiveness of the approach is demonstrated on synthetic and the well-known real-life datasets, providing significant scalable and stable performance improvements compared to the traditional methods and the state-of-the-art approaches. We also provide the source code of our approach to facilitate further research and replicability of our results.

keywords:

feature selection, ensemble learning, the curse of dimensionality, hierarchical stacking, light gradient boosting machine (LightGBM), time series forecasting.

1 Introduction

We study feature selection for time series regression/prediction/forecasting tasks for settings where the number of features is large compared to the number of samples. This problem is extensively studied in the machine learning literature as it relates to the infamous “curse of dimensionality” phenomenon, which suggests that machine learning models tend to struggle in cases where the number of samples is not sufficient given the number of features for effective learning from data [2, 30]. This results in over-training to obtain a model with high variance, i.e., low generalization ability [11]. This feature selection problem is even more prominent in non-stationary environments, e.g., for time series data or drifting statistics, where the trend, or the relationship between the features and the desired output changes significantly over time, making it challenging to identify relevant features.

Generally, the problem is addressed by i) considering all subsets of the features such as wrappers [16], ii) using feature characteristics such as filters [23], iii) embedding feature selection into the model learning process such as Lasso Linear Regression [28], Random Forest [6], and iv) feature extraction methods such as Principal Component Analysis (PCA) [22]. However, these methods are not effectively utilizing the information of the data, e.g., simply exploiting the “dominant” features without exploring others, and/or not scalable in reduction as the number of dimensions grows, i.e., computationally inefficient or not dynamic enough to adapt well as the domain of the features varies depending on the task. These widely used methods are also univariate, considering each feature once in calculating their importance. Directly evaluating all the subsets of the features for a given data becomes an NP-hard problem as the number of features grows, rendering such techniques, e.g., wrappers, computationally infeasible since the number of subsets reaches over a billion when the number of features exceeds thirty. Embedding selection methodology into modeling, e.g., feature selection based on feature importance scores is also inadequate since, with limited data, these scores are unreliable giving vague explanations about gain or split-based selection for tree-based models [20]. One possible solution is to use ensemble or bagging techniques, where different machine learning algorithms are trained on different subsets of the feature vectors. However, this approach also leads to losing co-dependency information between features. Lastly, unsupervised feature extraction techniques such as PCA suffer from not incorporating valuable information from the underlying task where the original task, be it regression or classification, is supervised.

Here, we introduce a highly effective and versatile hierarchical stacking-based novel ensembling approach to this problem, where we first train an initial machine learning model using a subset of the full features and then a second machine learning model in the hierarchy takes the remaining features as inputs using only the outcome of the first model on the cost function. The hierarchy increases until either the features exhaust and/or a user-controlled hierarchy depth is reached. Therefore, this generic structure allows for the depth of the hierarchy to be a design parameter, as well as the features used in each layer. By exploiting the co-dependency between features in a hierarchical manner, our approach addresses the limitations of traditional feature selection methods and provides more reliable results than feature importance scores. We build upon the substantial work demonstrated in the ensemble feature selection domain and illustrate the success of our approach on the synthetic and well-known real-world datasets in terms of accuracy, robustness, and scalability.

2 Related Work

Filters [8] and wrappers [16] are the prevalent traditional feature selection methods in use. While the former utilizes statistical tests such as chi-square, information gain, and mutual information, wrappers recursively do forward selection or backward elimination to the current feature subset according to an evaluation metric assigned to the model. The forward selection has a hard time finding good co-predictors while it is faster than backward elimination resulting in better scalability to larger datasets. In situations where wrappers overfit, filters are used with the knowledge of statistical tests [8]. Filters are fast in computation, easy to be scaled to higher dimensional datasets, and independent of the model while the dependencies between features are ignored. Wrappers, on the other hand, interact with the model and model the feature dependencies while they tend to require more computational resources since all feature subsets are tedious to try compared to filters. Saeys et al. [25] suggest more advanced methods such as an ensemble of feature selection methods and deep learning for feature selection. Moreover, Hancer [13] proposes a wrapper-filter feature selection method using fuzzy mutual information that overcomes standard mutual information’s limitations. Our approach differentiates from wrappers, filters, and the methods of Saeys et al. and Hancer since we propose a multivariate solution that processes the groups of features by leveraging the codependency in the groups of features.

Boosting methods in feature selection as suggested by Das [8] are equipped with boosted decision trees where the metric is information gain and weak learners are decision stumps. They perform well with the help of increasing the weight of each high-loss decision stump. This paper also proposes a hybrid method that uses a filter method for initial feature ranking and selection, followed by a wrapper method that evaluates the selected features using a classification model, combining the high accuracy of wrappers and time efficiency of filters [25]. Even though the hybrid model of wrappers and filters generates a more efficient model, our method is more time-efficient and utilizes the statistical importance of features as well as the prior knowledge of the side information. On top of these, the gradient boosting models are not generic for every loss function since these models require the hessian of the loss function to be nonzero. Our approach can integrate any external loss function in the middle of the system, independent of the boosting algorithm, which brings a novel solution to the problem.

The methods such as feature importance extracted by SVM [7] or Random Forest [6] algorithms and unsupervised feature extraction methods are not task-specific and might be inadequate to exploit domain knowledge in many areas. As for leveraging the ensemble approach, Saeys et al. [24] investigate the impact of four different techniques of feature selection namely Symmetric Uncertainty [19] for univariate cases, RELIEF [29] for multivariate cases, feature importance measures of Random Forest and SVMs, and finally SVM-based Recursive Feature Elimination [26]. Besides, the ensembles from a bagging point of view are made with perturbations for each technique. The classification performance and robustness are merged into an F1 measure with a custom weight which results in choosing an ensemble of Random Forest [24]. While feature importance scores of tree-based models bring insight into the dataset, they ignore the correlations between features since each feature is individually analyzed. As an improvement on the RELIEF algorithm, Škrlj et al. [33] developed an approach of ReliefE that suits high-dimensional sparse input spaces. Their solution lies in adapting manifold-based embeddings of feature and target space to multi-class classification problems. Although this method proposes a context-dependent space complexity, we enable a determined smaller complexity depending on the size of the input data and the number of layers in the iterated algorithm.

One of the commonly used feature selection methods is a tree ensemble of Random Forest to determine a threshold to slice the feature subset based on the average information gain of each node [3]. There are also threshold determination methods using data complexity measures. Seijo-Pardo et al. discuss an automatic method to determine the threshold, unlike other methods that are task-specific [27]. This method takes the weighted average of the complexity measures such as the Maximum Fisher discriminant ratio (F1), the volume of overlap region (F2), maximum feature efficiency (F3), complexity measures (CF) and the percentage of features retained. After different ranking methods from filters and embedded models are combined with the min-combination method, which selects the minimum of the relevance values coming from each ranking [4], the threshold is determined with one of the complexity measures. Then, the classification is completed with the selected subset of features. Another design given in this work applies thresholding to each ranking method. Having the thresholded sets, the rest is the same as the first design. All in all, an automatic thresholding method performs better than fixed thresholding methods [27]. Although Random Forest with a varying thresholding mechanism may work better than ordinary model, the issue of univariate feature importance scores, which we tackle thoroughly by kee** informative features together, is not solved in this proposal. Another novel method proposed by Fumagalli et al. [12] is incremental permutation feature importance (iPFI) which is an online variation of the batch permutation method offering two sampling strategies to calculate marginal feature distributions. However, the lack of human-grounded experiments is compensated in our simulations. Additionally, the algorithm lacks in considering co-dependency between feature subsets, indicating univariance.

The novel approach suggested by Jenul et al. [14] parallels our aim of incorporating domain knowledge as well as the dominant data itself. They employ a user-guided ensemble feature selector which includes the likelihood approach, prior weights from expert knowledge, and side constraints as regularization. Afterward, these components are combined with an optimization rule. Although our approach echoes this method in prior weights in expert knowledge and the optimization loop, we overcome the limitations of loss functions. We employ a flexible optimization loop which extends to any possible customized loss.

Overall, the contemporary literature addresses the feature selection either in a disjoint manner, i.e., in a model/domain-independent fashion or via favoring the seemingly dominant features while failing to explore majority of the features. Unlike previous studies, our model as the first time in the literature fully exploits the co-dependency between features via a novel hierarchical ensemble-based approach. This allows for an adaptive, i.e., dynamic feature selection, which is especially useful in nonstationary environments, e.g., in time series settings. The introduced architecture is generic, i.e., both the depth of the hierarchy and the base models employed are user-controlled. As shown in our simulations, the introduced model provides significant improvements in performance as well as scalability over the well-known real life competition datasets compared to the traditional as well as the state-of-the-art feature selection approaches. We publicly share the implementation of our algorithm for both model design, comparisons, and experimental reproducibility ¹¹1https://github.com/aysintumay/hierarchical-feature-selection.

3 Problem Description

All vectors in this paper are column vectors in lowercase boldface type. Matrices are denoted by uppercase boldface letters. Specifically, $\boldsymbol{X}$ represents a matrix containing ${x}^{(k)}_{t}$ , i.e., the $k^{th}$ sequence of vector $\boldsymbol{x}_{t}$ , and ${x}^{(k)}$ in the $k^{th}$ column for each time $t$ . $X_{i,j}$ denotes the element of $\boldsymbol{X}$ in the $i^{th}$ row and $j^{th}$ column. Ordinary transposes of $\boldsymbol{x}_{t}$ and $\boldsymbol{x}$ are denoted as $\boldsymbol{x}^{T}_{t}$ and $\boldsymbol{x}^{T}$ , respectively. The mean and standard deviation of ${x}_{t}^{(k)}$ , i.e., the $k^{th}$ dimension of $\boldsymbol{x}_{t}$ , are denoted by $\bar{{x}}_{t}^{(k)}$ and $\sigma({x}_{t}^{(k)})$ , respectively. The covariance of a time series $\{y_{t}\}$ with its $k$ times delayed version is represented as $\gamma_{y}(t,t+k)=cov({y}_{t},{y}_{t-k})$ . The gamma function, having the property of $\Gamma(N)=(N-1)!$ generalizes the concept of a factorial to real and complex numbers.

We study feature selection in time series prediction of a sequence $\{y_{t}\}$ . We observe this target sequence along with a side information sequence (or feature vectors) $\boldsymbol{x}_{t}$ each of which is of size $M$ . At each time $t$ , given the past information $\{y_{k}\}$ , and $\boldsymbol{x}_{k}$ for $k\leq t$ , we produce the output $\hat{y}_{t+1}$ . Hence, in this setting, our goal is to find the relationship

\displaystyle\hat{y}_{t+1}=F_{t}\big{(}\{

\displaystyle y_{1},y_{1},\ldots,y_{t}\},\{\boldsymbol{x}_{1},\boldsymbol{x}_{% 2}\ldots,,\boldsymbol{x}_{t}\}\big{)},

where $F_{t}$ is an unknown function of time, which models $\hat{y}_{t+1}$ . We introduce a hierarchical nonlinear ensemble model for $F_{t}$ with an efficient feature selection procedure integrated in. Throughout the training of the model, we suffer the cumulative loss

L=\sum_{t=1}^{N}\ell(y_{t},\hat{y}_{t}),

where $N$ is the number of data points and $\ell$ can be, for example, the squared error loss, i.e., $\ell(y_{t},\hat{y}_{t})=(y_{t}-\hat{y}_{t})^{2}$ .

To exemplify the problem and illustrate the significance of $\{y_{t-j}\}$ for some $j>0$ values, i.e., lagged target sequence among features, we give the example of wind energy production prediction [17]. This task provides information from $\{y_{t-j}\}$ and features generated from it, also called $y_{t}$ -related sequences, and weather conditions. Weather-based features constitute the side information, indirectly related to $\{y_{t}\}$ . As shown in the wind prediction literature review [17], the $y_{t}$ -related features are the dominant features. The predictions of the model that utilize both $y_{t}$ -related features, as well as weather information, largely follow the long-term patterns of the former, i.e., the weather-based side information is hardly utilized as illustrated in Section 5. Additionally, this model is prone to overfitting due to its high dimensionality [11]. We aim to effectively incorporate weather-based features into the model, as they can capture short-term abnormalities caused by $y_{t}$ -related features. In this sense, our approach initially shapes the long-term patterns with dominant feature vectors. Then, it fine-tunes them with the side information through cost optimization, resulting in using the side information and generating a multivariate solution. Note that this phenomenon happens in most real-life applications, where due to the dominance of certain features, the other features are hardly utilized.

Formally, at time step $t$ , we have a matrix $\boldsymbol{X}$ with a length $N$ and dimensionality $M$ , i.e., the $N$ data points of $\boldsymbol{X}$ form an Euclidean space of $M$ -dimensions. Considering this hypersphere, we generalize the distance between data points in the $M$ -dimensional hypersphere as $d\approx 2\cdot r$ , where $r$ is the radius of the hypersphere. If $M$ approaches $N$ , the volume of the hypersphere, $V_{n}(r)$ , given by:

\centering V_{n}(r)=\frac{\pi^{\frac{M}{2}}}{\Gamma\left(\frac{M}{2}+1\right)}% r^{M}\@add@centering

(1)

increases exponentially, leading to a sparser data distribution characterized by a larger $d$ [1].

We suffer data sparsity also due to non-stationarity, which is generally caused by trends and seasonality, while conducting time series forecasting for target sequence $\{y_{t}\}$ . As such trends and seasonality reduce the relative number of unique data points, which can lead to overfitting.

To this end, we propose a hierarchical ensemble-based feature selection method for the time series forecasting task to overcome overfitting and non-stationarity. We split $\boldsymbol{x}_{t}$ into $K$ non-intersecting feature subsets $\boldsymbol{s}_{t}^{(k)},k=1,...,K$ based on domain knowledge or using certain heuristics such as feature importance metrics as demonstrated in our experiments. After this split, we train $K$ machine learning models, in a dependent manner, that take each $\boldsymbol{s}_{t}^{(k)}$ as input in a hierarchical order by optimizing a cost function after each model until the $K^{th}$ one. Before describing our approach, we discuss the well-known approaches to this problem for completeness in the following.

3.1 Common Approaches

There are many approaches for circumventing the curse of dimensionality such as wrapper-based, embedded, filtering, and ensemble methods. Let us denote a subset of the full feature vector $\boldsymbol{x}_{t}$ as $\boldsymbol{s}^{(k)}_{t}$ , $k=1,\ldots,2^{M}$ , i.e., given $M$ features, there exist $2^{M}$ different subsets of the feature set, which may include one or more features from $\boldsymbol{x}_{t}$ .

3.1.1 Wrapper-based Methods

As a greedy method, the well-established wrappers have an optimization objective over the validation loss that finds the best-performing feature subset $\boldsymbol{s}^{*}_{t}$ as follows:

\displaystyle\boldsymbol{s}^{*}_{t}=\operatorname*{arg\,min}_{\boldsymbol{s}^{% (k)}_{t}\subset\boldsymbol{x}_{t}}L_{val}=\operatorname*{arg\,min}_{% \boldsymbol{s}^{(k)}_{t}\subset\boldsymbol{x}_{t}}\sum_{t^{\prime}=t_{1}}^{t_{% 2}}\ell(y_{t^{\prime}},f_{\boldsymbol{w}}(y_{t^{\prime}},\boldsymbol{s}^{(k)}_% {t^{\prime}})).

(2)

subject to $k=1,\ldots,2^{M}$ , where $\{y_{t}\}_{\{t_{1}:t_{2}\}}$ and $\{s_{t}^{(k)}\}_{\{t_{1}:t_{2}\}}$ are the target sequence and the feature subset ${s_{t}^{(k)}}$ of the validation set between $t_{1},t_{2}\in\mathbb{Z},\quad 1<t_{1}<t_{2}<t$ , $f_{w}$ is a machine learning model trained on the $k^{\text{th}}$ feature subset $\boldsymbol{s}^{(k)}_{t}$ with parameters $\boldsymbol{w}$ , and $\ell$ is the loss function of the model. The algorithm can iterate through each feature subset $\boldsymbol{s}^{(k)}_{t}$ seeking to maximize the performance of the machine learning model on the validation set. Naturally, due to computational complexity issues, wrappers are hardly used in a complete form in real-life applications in most cases.

3.1.2 Embedded Methods

As another approach in the literature, embedded methods perform feature selection as a part of the model construction process. Examples of these methods include Random Forests, and Gradient Boosting Trees. Tree-based models eliminate features once or recursively (also known as Recursive Feature Elimination) based on their feature importance rankings and uses the remaining ones for training. One drawback of this method is that it is univariate, considering one feature at a time while calculating the scores.

Refer to caption — Figure 1: We have $K$ number of feature subsets used as inputs to $K$ base models (blue). Then, we combine base learners with $\boldsymbol{\alpha}_{t}^{(i)}$ for final prediction (pink).

3.1.3 Filtering Methods

Another traditional method of feature selection is filtering. Unlike other methods, filtering relies on statistical measures instead of using machine learning algorithms. For instance, the score of the Pearson Correlation Coefficient [21] between $x_{t^{{}^{\prime}}}^{(k)}$ and $\{y_{t^{{}^{\prime}}}\}$ for $t^{{}^{\prime}}\leq t$ is $S(x^{(k)})={\text{cov}}(x^{(k)},y)/(\sigma(x^{(k)})\cdot\sigma(y))$ .

Filtering algorithm can incrementally form an optimal set $\boldsymbol{s}_{t}^{*}$ in terms of correlation with $m$ feature vectors from $\boldsymbol{x}_{t}$ based on maximum dependency. This is accomplished by discarding the lowest correlation score giving feature vectors from $\boldsymbol{x}_{t}$ in each iteration. On the other hand, filtering methods do not inherently incorporate domain knowledge and cannot measure nonlinear dependency since they solely rely on statistical measures. Moreover, they are univariate, calculating the score of each feature one by one.

3.1.4 Ensemble-based Methods

In this method, the predictions of machine learning models, also called base learners, are directly used to determine the weight vector that ensembles the base learners, as shown in Figure 1. The version with two base learners is demonstrated in Algorithm 1. All base learners take different feature subset vectors as input. Combining the predictions of $K$ base learners, denoted as ${{\tilde{y}}^{(i)}_{t}},i=1,\ldots,K$ , the ensemble prediction is found as follows:

\centering{{\tilde{y}}_{t}^{E}=\boldsymbol{\alpha}_{t}}^{T}{\boldsymbol{\tilde% {y}}_{t}},\@add@centering

(3)

where $\boldsymbol{\tilde{y}}_{t}=[{\tilde{y}}^{(1)}_{t},...,{\tilde{y}}^{(K)}_{t}]^{T}$ is the base prediction vector of $K$ machine learning models, and $\boldsymbol{\alpha}_{t}$ is the ensembling coefficient vector $\boldsymbol{\alpha}_{t}\in\mathbb{R}^{K}$ . With affine-constraint optimization, the loss of ensemble models is subject to $\boldsymbol{\alpha}_{\boldsymbol{s}_{t}^{(i)}}^{T}\boldsymbol{1}=1$ :

\centering\operatorname*{min}_{{\alpha}_{\boldsymbol{s}_{t}^{(i)}}\in\mathbb{R% }^{K}}\ell\big{(}{y}_{t},\sum_{i=1}^{K}{\alpha}_{\boldsymbol{s}_{t}^{(i)}}^{(i% )}\,{\tilde{y}}_{t}^{(i)}\big{)}.\@add@centering

(4)

The optimization hyperspace is $(K-1)$ -dimensional when the $K^{th}$ component of $\boldsymbol{\alpha}_{t}$ complements the sum of the entire vector to be 1. Therefore, the subject of the minimization changes into ${\alpha}_{\boldsymbol{s}_{t}^{(i)}}=1-\boldsymbol{\alpha}_{\boldsymbol{s}_{t}^% {(i)}}^{T}\boldsymbol{1}$ . In this sense, conventional ensemble methods can be computationally expensive, and more significantly, since they are independently trained, they cannot exploit the co-dependency between feature subsets.

Algorithm 1 Ensemble Model

{\tilde{y}}^{(1)}_{t}=g(\boldsymbol{s}_{t}^{(1)},{y}_{t})

\triangleright

Train base learner

g

with

\boldsymbol{s}^{(1)}_{t}

{\tilde{y}}^{(2)}_{t}=h(\boldsymbol{s}^{(2)}_{t},{y}_{t})

\triangleright

Train base learner

h

with

\boldsymbol{s}^{(2)}_{t}

3:for

t=1

N

\triangleright

Iterate through each time step

t

\text{min\_loss}\leftarrow\infty

\tilde{\alpha}_{t}\leftarrow 0

6: for

\alpha=0

1

{\tilde{y}}_{t}=\alpha\cdot{\tilde{y}}^{(1)}_{t}+(1-\alpha)\cdot{\tilde{y}}^{(% 2)}_{t}

\text{loss}=\ell({y}_{t},\tilde{y}_{t})

9: if

\text{loss}<\text{min\_loss}

then

10:

\text{min\_loss}=\text{loss}

11:

\tilde{\alpha}_{t}\leftarrow\alpha

12: end if

13: end for

14:end for

4 Hierarchical Ensemble-Based Approach

To overcome the limitations of traditional feature selection methods, we propose a hierarchical ensemble-based approach involving $K$ distinct machine learning models organized into $K$ levels. Figure 2 illustrates two sample successive layers of the structure. Each machine learning model takes the output of the previous layer ( $f^{(i-1)}$ in Figure 2). Then, the latter model ( $f^{(i)}$ in Figure 2) generates the predictions of the optimized weights that scale the last output. Each model operates on a different subset of $\boldsymbol{x}_{t}$ , which may include exogenous information, also called the side-information, and features derived from the past information of the target sequence $\{y_{t}\}$ , $\{y_{t-j}\},j=1,\ldots,N-1$ , denoted as $\boldsymbol{s}_{t}({y})$ .

While the hierarchical ordering of the models is guided by domain expertise, we propose that the first level should exclusively comprise $\{{y}_{t-j}\}$ (and features derived using them, e.g., their rolling means) referred to as ${y}_{t}$ -related features as shown in different time series prediction papers [32, 18]. The reason is that these past target values exhibit higher importance scores than the side information sequences, thereby dominantly influencing predictions. In fact, many machine learning-based time-series models suffer from “overfitting” to the ${y}_{t}$ -related features and ignore most of the features [10]. Subsequent levels can incorporate the side information sequences $\boldsymbol{s}_{t}^{(k)}\subset\{\boldsymbol{x}_{t}\}\setminus\boldsymbol{s}_{% t}({y})$ for $k=1,\ldots,K-1$ . We next describe the layers in the hierarchy and the optimization procedure thereof.

4.1 Description of Layers

To explicate the layers composing the introduced architecture, in Figure 2, we present a snapshot of the generic model with all the major components in action; these represent the core operations in the overall hierarchy which are repeated in succession. In Figure 2, we observe four main components in transitioning from layer $i-1$ to $i$ : two machine learning models (left and middle right), a cost optimization function (middle left), and a linear superposition function (right). The leftmost machine learning model, $f^{(i-1)}$ , is fed with $(i-1)\textsuperscript{th}$ restricted side information sequence $\boldsymbol{s}_{t}^{(i-1)}$ , which corresponds to the “feature bagging” technique for addressing the bias-variance trade-off [6], as well as with the refined predictions of the previous layer, which we will elaborate on in the following. There is no restriction on what this learning model could be, nor on the loss function it aims to minimize. After the learning is finished, the predictions of the model are acquired and passed onto the cost optimization block (middle left part in Figure 2). Therein lies the novelty of our algorithm, as we, unlike the usual boosting procedure, e.g., LightGBM uses, do not transmit these predictions as is to the next model in the chain but instead subject it to a weighting. To minimize the final loss, the cost optimization function $g$ generates a weight sequence $\alpha_{t}^{(i)}$ that scales the previous prediction sequence, $\tilde{y}_{t}^{(i-1)}$ . Another novelty of this method lies in the compatibility of any loss function in the cost optimization stage, which extends on the limitations of loss functions with non-practical second derivatives, e.g., the L1 loss. The details of the optimization of $\alpha_{t}^{(i)}$ s are given in Section 4.2.

For context-awareness, we further feed these optimized $\alpha_{t}^{(i)}$ s into another learning model, $f^{(i)}$ , which uses the current context $\boldsymbol{s}_{t}^{(i)}$ for its training; we acquire the context-aware weights $\tilde{\alpha}_{t}^{(i)}$ out of it. Then in the last superposition stage, a linear function $h$ scales the prediction of the leftmost model $f^{(i-1)}$ with $\tilde{\alpha}_{t}^{(i)}$ s to obtain the refined predictions, which are then fed to the next block in the series. The chain of blocks continues this way until $i$ hits the user-defined hierarchy size parameter $K$ . This flow of the model is depicted in Algorithm 2.

Formally, in the transition from $(i-1)\textsuperscript{th}$ to $i\textsuperscript{th}$ layer of the algorithm, we first generate $(i-1)\textsuperscript{th}$ model’s predictions for each time $t$ , denoted as ${\tilde{y}}^{(i-1)}_{t}$ , where $i=2,\ldots,K$ , inputting $\boldsymbol{s}_{t}^{(i-1)}$ into base learner $f^{(i-1)}$ . Subsequently, we deduce the weight sequence ${\alpha}^{(i)}_{t}$ to refine ${\tilde{y}}^{(i-1)}_{t}$ by optimizing iteratively in a loop as depicted in Algorithm 2 for:

\centering{\hat{y}}^{(i)}_{t}={{\alpha}_{t}^{(i)}}{\tilde{y}}_{t}^{(i)},\@add@centering

(5)

where $\alpha_{t}^{(i)}\in[0,2]$ for each time $t$ . As depicted in the middle left pink block in Figure 2, we determine ${\alpha}^{(i)}_{t}$ with an optimization loop to minimize the loss at each time $t$ in layer $i$ as,

\displaystyle L_{t}=\ell({y}_{t},{\hat{y}}_{t}^{(i)})

\displaystyle=\ell({y}_{t},{{\alpha}_{t}^{(i)}}{\tilde{y}}_{t}^{(i)}).

(6)

In general, $\ell$ need not be differentiable. Unlike ensemble methods, we optimize $\ell$ with ${\tilde{y}}^{(i)}_{t}$ and ${\tilde{y}}^{(i-1)}_{t}$ by scaling ${\tilde{y}}^{(i-1)}_{t}$ at each time $t$ with ${\alpha}^{(i)}_{t}$ . The cost optimization step is further elaborated in Section 4.2. In this sense, in the $i^{th}$ layer, we update the output of the $(i-1)^{th}$ layer using the features that belong to the $i^{th}$ layer. Therefore, every subset of features contributes to minimizing the final error similar to boosting or stacking.

As shown in the middle right block in Figure 2, we train the consequent model ( $f^{(i)}$ in Figure 2) with ${\alpha}^{(i)}_{t}$ and $\boldsymbol{s}_{t}^{(i)}$ to predict the ultimate weights ${{\tilde{\alpha}}}^{(i)}_{t}$ , which are context-aware. In the green block at the end, we modify ${\tilde{y}}^{(i)}_{t}$ using ${\tilde{\alpha}}^{(i)}_{t}$ to incorporate the side information, leveraging the learned patterns from the error of the previous layer with a linear superposition function $h$ , as represented by

{\tilde{y}}^{(i+1)}_{t}=h(\tilde{y}_{t}^{(i)},\tilde{\alpha}_{t}^{(i)})={{% \tilde{\alpha}}^{(i)}_{t}}{\tilde{y}}^{(i)}_{t}.

(7)

In the following section, the cost optimization step between layers 1 and 2 in Algorithm 2 is elaborated.

Algorithm 2 Hierarchical Ensemble-Based Method

\boldsymbol{s}^{(i-1)}_{t}\leftarrow\boldsymbol{s}_{t}({y})\in\boldsymbol{X},% \text{ where }t=1,2,\dots,N,i=2,3,\dots,K+1

{\tilde{y}}^{(i)}_{t}\leftarrow f^{(i-1)}(\boldsymbol{s}^{(i-1)}_{t},\tilde{y}% ^{(i-1)}

)

3:procedure Cost Optimization

4: for

t=1,2,\dots,N

min\_loss\leftarrow\infty

range=[1-\beta,1+\beta],\text{ where }\beta\in[0,1]\subset\mathbb{R}

7: for

\alpha\leftarrow 1-\beta

1+\beta

\hat{y}^{(i)}_{t}=\alpha\tilde{y}^{(i)}_{t}

loss=\ell(y_{t},\hat{y}^{(i)}_{t})

10: if

loss<min\_loss

then

11:

min\_loss=loss

12:

\alpha_{t}^{(i)}=\alpha

13: end if

14: end for

15: end for

16:end procedure

17:

\boldsymbol{s}^{(i)}_{t}\leftarrow\boldsymbol{s}^{(i)}_{t}\subset\{\boldsymbol% {x}_{t}\}\setminus\boldsymbol{s}_{t}({y})

18:

{\tilde{\alpha}}_{t}^{(i)}\leftarrow f^{(i)}(\boldsymbol{s}^{(i)}_{t},{\alpha}% _{t}^{(i)})

19:

{\tilde{y}}^{(i+1)}_{t}\leftarrow{\tilde{\alpha}}_{t}^{(i)}{\tilde{y}}^{(i)}

4.2 Cost Optimization

The cost optimization refers to the iterative approach, in which we employ $\alpha_{t}^{(i)}$ from a determined range to modify $\tilde{y}_{t}^{(i)}$ in any layer $i$ . We aim to find $\alpha_{t}^{(i)}\in\mathbb{R}$ for each time $t$ . Finally, we have the following optimization objective:

\centering\operatorname*{min}_{{\alpha}_{t}\in\mathbb{R}}\ell\big{(}{y}_{t},{% \alpha}_{t}^{(i)}\,{\tilde{y}}_{t}^{(i)}\big{)},\@add@centering

(8)

where $\ell$ can be any loss subject to ${\alpha}_{t}^{(i)}\in[0,2]$ for each time $t$ such that if a weight is in the range $[0,1]$ the prediction is effectively downscaled, and similarly for the range $[1,2]$ , we employ upscaling to leverage robustness. This procedure aims to be flexible so that the algorithm can be scaled to any domain-specific loss function. The “Cost Optimization” section in Algorithm 2 and the leftmost pink block in Figure 2 show the structure of the cost function.

To understand the contribution of our cost optimization approach, we compare it with the loss structure of a powerful tree-based model, LightGBM [15]. During the training process, LightGBM iteratively updates the model by minimizing the chosen cost function. This process is typically performed using gradient-boosting techniques. Choosing the base model of LightGBM as a decision tree, the leaf split finding operation is completed with the high-level insight provided by the hessian and gradient of the loss function. Defining the gradient as

{g}_{t}(x)=\mathit{E}_{y}\left[\frac{\partial\ell(y_{t},f(x_{t},\theta,y_{t}))% }{\partial f(x_{t},\theta,y_{t})}\,|\,x\right]_{f(x_{t},\theta,y_{t})=\hat{f}_% {t-1}(x_{t},\theta,y_{t})},

(9)

and the hessian is defined as

{h}_{t}(x)=\mathit{E}_{y}\left[\frac{\partial^{2}\ell(y_{t},f(x_{t},\theta,y_{% t}))}{\partial f(x_{t},\theta,y_{t})^{2}}\,|\,x\right]_{f(x_{t},\theta,y_{t})=% \hat{f}_{t-1}(x_{t},\theta,y_{t})}.

(10)

If we use the customized loss function of LightGBM, it would calculate the gradient and hessian of $\ell$ with respect to $\tilde{y}_{t}^{(i)}$ to determine the direction and magnitude of the updates. As an example, the negative gradient for L1 loss is given by

g_{t}^{(i)}=\frac{\partial\ell(y_{t},\alpha_{t}\tilde{y}_{t}^{(i)})}{\partial% \tilde{y}_{t}^{(i)}}=\alpha_{t},

(11)

which leads to the hessian being impractically 0, in a form that LightGBM cannot natively process. Therefore, the hessian of the loss function should be nonzero while working with a custom function.

In our application, we bring another approach to make it convenient to embed any custom loss into our base learners, e.g., LightGBM. In our case, we define a $\beta\in\mathbb{R}$ value, which is also a tuned hyperparameter in the range of $[0,1]$ , for $1+\beta$ to be the higher limit and $1-\beta$ be the lower limit of the chosen $\alpha_{t}^{(i)}$ value. As depicted in Algorithm 2, we find the optimal $\alpha_{t}^{(i)}$ in a greedy process aiming to minimize any custom loss function, directly for each time step $t$ . Therefore, the objective in (6) is employed. After this step, $\tilde{\alpha}_{t}^{(i)}$ vector is inputted to the following model depicted as $f^{(i)}$ in Figure 2.

By iterating the process in Algorithm 2, we can effectively search for the best combination of features and capture the correlation between them. For the sake of improving traditional feature selection methods, the iterative process allows us to optimize $K$ models simultaneously, as the predicted output from one model is used to enhance the other model.

In the next section, we illustrate the performance of our hierarchical ensemble-based model on synthetic and widely known real-life datasets.

5 Simulations

In this section, we illustrate the performance of our hierarchical ensemble-based model with 2 layers, i.e., $K=2$ , in comparison with other models on various well-known time series datasets. Initially, we introduce the models used for comparison. Then, we provide the performance of our model.

5.1 Compared Models

The simulations include 5 models that are labeled as Wrapper, Ensemble, Hierarchical Ensemble, Embedded, Baseline LightGBM. The first compared method, Wrapper, which is described in Section 3, discards the feature that gives the least contribution to the model based on the L2 loss in each iteration. The Embedded model described in Section 3 only uses $\{y_{t-j}\}$ and features derived from $\{y_{t-j}\}$ , e.g. rolling features, namely as $\boldsymbol{s}_{t}(y)$ . The reason that $y_{t}$ -related features are employed in the simulation as input to another model is to investigate if the most important features are enough without requiring domain knowledge. The Ensemble model works based on Algorithm 1 with 2 baseline models, mixing $\boldsymbol{s}_{t}(y)$ and $\{\boldsymbol{x}_{t}\}\setminus\{\boldsymbol{s}_{t}(y)\}$ with ${\alpha}_{t}^{(i)}$ which is chosen in an iterative process minimizing the L1 objective. Lastly, Baseline LightGBM model refers to the model that uses the whole $\boldsymbol{x}_{t}$ . With this model, we seek to find if feature selection for the datasets is necessary at all. Moreover, we expect to observe overfitting due to the high dimensionality and non-stationarity.

The evaluation metric used in the experiments is the mean square error. All experiments are iterated 200 times to ensure the reliability of the results. The synthetic dataset is generated 200 times with a random Gaussian noise. For the real-life dataset, we randomly sampled 200 out of 414 series of the hourly M4 Forecasting, which is the widely publicized competition dataset [31]. We obtained the cumulative sum of error between $\{\tilde{y}_{t}\}$ and $\{y_{t}\}$ for the $j^{th}$ experiment as follows:

\centering{\mathit{l}}_{t}{{}^{(j)}}=({y}^{(j)}_{t}-{\tilde{y}}^{(j)}_{t})^{2}% /N.\@add@centering

(12)

Then, the average over the 200 trials is taken so as to eliminate the bias due to using a particular sequence as,

\centering{\bar{l}}_{t}=\frac{\sum_{j=1}^{200}{l}^{(j)}_{t}}{200}.\@add@centering

(13)

Finally, the cumulative sum over time is taken to smoothen the results as follows:

\centering{\bar{l}}^{(ave)}=\frac{\sum_{t^{{}^{\prime}}=1}^{t}{\bar{l}}_{t^{{}% ^{\prime}}}}{t}.\@add@centering

(14)

About the experiment settings, the $\beta$ hyperparameter in the cost optimization step depicted in Algorithm 2 is fixed to 0.33 for both experiments. Therefore, the range of $\alpha_{t}^{(i)}$ is $[0.66,1.33]$ .

In the next sections, we give analysis and experiments of synthetic and real-life datasets. As we simulate with $K=2$ models, we denote $\boldsymbol{s}_{t}^{(i-1)}$ as $\boldsymbol{s}_{t}^{(1)}$ ; likewise $\alpha_{t}^{(i)}$ as $\alpha_{t}$ .

5.2 Synthetic Dataset

The data is generated with an autoregressive moving average (ARMA) process of order $(4,5)$ , i.e.,

\displaystyle\mathit{y_{t}}

\displaystyle=\sum_{i=1}^{4}\Phi_{i}\mathit{y_{(t-i)}}+\sum_{j=1}^{5}\theta_{j% }\epsilon_{(t-j)}+\epsilon_{t},

(15)

where the autoregressive part is represented by the lagged values up to 4 times, and the moving average is represented by the lagged error terms up to 5 times. The $\Phi$ and $\theta$ variables control the strength of the lags [5]. In our setting, $\Phi=[0.4,0.3,0.2,0.1]$ , and $\theta=[0.65,0.35,0.3,-0.15,-0.3]$ . The Augmented Dickey-Fuller [9] test reveals the p-value as 0.2116, showing non-stationarity. Then, the series is transformed with a min-max scaler.

We first generated the domain knowledge-representing feature subset $\{\boldsymbol{x}_{t}\}\setminus\{\boldsymbol{s}_{t}(y)\}$ . For that, a binary classification feature set is designed with fully informative 26 features. The synthetically generated binary sequence $y_{t}^{b}$ has a class imbalance of $0.65$ . The $y_{t}$ values obtained from (15) that have indexes in the corresponding label 1 in $y_{t}^{b}$ are multiplied by $1.33$ to upscale, while others are multiplied by $0.66$ to downscale. Therefore, we guarantee dependence on the generated domain knowledge series, which can provide a substantial amount of pattern. As the final step of generation, we added a Gaussian normal noise of $\mathit{N}(0,0.5)$ . To simulate the curse of dimensionality problem, the number of total features is set to 36, as demonstrated in Table 1, and most of the features are informative. The average p-value is also higher than $0.05$ , which validates that our experiment settings are suitable for non-stationarity.

Upon generating the synthetic dataset with high dimensionality and non-stationarity, our dataset is well-suited to the problem statement in Section 3.

Based on the performance plot in Figure 3, we highlight that our proposed method outperforms other compared models while the overall loss trend is decreasing. We confidently prompt that the $y_{t}$ -related features, given to $\boldsymbol{s}_{t}^{(1)}$ , find most of the patterns in the label while $\boldsymbol{s}_{t}^{(2)}$ finetunes the short-term patterns, as desired. Moreover, the descending trend indicates the robustness of our method. The Wrapper method is significantly the worst-performing model based on the MSE scores since it could not converge in a global optimum, using a suboptimal feature subset. From Table 2, the extreme time consumption of this method compared to Hierarchical Ensemble also verifies the time efficiency of our model. Hence, Wrapper is the least efficient model among other compared methods. Additionally, Embedded and Ensemble models are significantly close in loss. One can say that Embedded model generates long-term patterns successfully with $y_{t}$ -related features. Therefore, Ensemble chooses $y_{t}$ -related compared to date-related features. Even though the optimal feature subset is found by Embedded, it is outperformed by Hierarchical Ensemble since our method fully incorporates the information provided by codependent feature pairs rather than deciding whether to choose $y_{t}$ -related or domain knowledge features. Moreover, Baseline LightGBM performs the closest to the proposed approach, which indicates the level of informativeness of the features. However, our proposed approach outperforms Baseline LightGBM since we overcome overfitting by training multiple models with less number of features while the other is overfitted easily due to a large number of features.

Table 1: Statistics of the Datasets

Dataset	Sample Size	Feature Size	Average p-value
Synthetic	500	36	0.2376
M4 Hourly Forecasting	1001	24	0.36856

5.3 M4 Competition Datasets

Here, the hourly M4 dataset is used as a real-life dataset, which includes 414 different time series data [31]. About the structure of the series, the M4 competition dataset does not include date-time indexes. Hence, we give the indexes externally to the dataset to extract date-related features. The sample size of the train set was reduced to 953 while the test set includes 48 samples as demonstrated in Table 1. Investigating $y_{t}$ regarding the stationarity test; we found the p-value to be $0.36856$ on average, which indicates non-stationarity, as demonstrated in Table 1. We give the $y_{t}$ -related features to $\boldsymbol{s}_{t}^{(1)}$ to be consistent with the synthetic dataset. We transformed the desired data through a min-max scaler. In this setting, we took the $2^{nd}$ , $4^{th}$ , and $6^{th}$ lags of the desired data to generate the mean and standard deviation of the rolling window feature. The date time features are included in $\boldsymbol{s}_{t}^{(2)}$ . The date-related features are the cosine and sinus vectors of the hour, day of the month, day of the week, month of the year, quarter of the year, and week of the year.

Based on the performance of the proposed method in the M4 hourly dataset in Figure 4, the Hierarchical Ensemble outperforms other models. We highlight that the $y_{t}$ -related features given in a higher level of hierarchy form the long-term patterns while the feature subset containing domain knowledge easily upscales or downscales the long-term patterns. Moreover, we demonstrate a more robust performance by the stationary cumulative error compared to other methods which are either oscillating. The reason that Baseline LightGBM performs worse than our method is because of the high dimensionality causing overfitting. Although the feature size is less than the synthetic experiment, the Baseline LightGBM memorizes the unique patterns in the train set due to the highly informative features.

The Embedded and Ensemble methods perform close to each other since the Ensemble model generally chooses the Embedded model giving $\alpha_{t}$ of 1 among the range of $[0,1]$ . It’s important to note that the Embedded utilizes a more dominant feature set compared to date-related features, which indicates the inclination of Baseline LightGBM to $y_{t}$ -related features. Our proposed method overcomes this issue by finetuning with the less dominant features in the last level for a greater impact on $\tilde{y}_{t}^{(2)}$ . Moreover, The Wrapper method clearly converges to a non-optimal local minimum since it performs the worst. It also demonstrates the least robust MSE performance due to oscillatory behavior in contrast to our method. From Table 2, the extreme time consumption of this method compared to Hierarchical Ensemble also verifies the efficiency of our model. As our proposed method performs the best, we conclude that the date-related features modify the first layer predictions successfully, solving overshooting or undershooting problems. Overall, the proposed method works satisfactorily in the real-life dataset outperforming other models.

Table 2: The Comparison of Average Time Consumption in Seconds

Dataset	Wrapper	Hierarchical Ensemble
Synthetic	393.8	9.3552
M4 Hourly Forecasting	75.24	17.46

6 Conclusions

In this work, we proposed an ensemble feature selection method based on hierarchical stacking. On top of the important milestones of traditional stacking methods, our approach leverages a hierarchical structure that fully exploits the co-dependency between features. This hierarchical stacking involves training an initial machine learning model using a subset of the features and then updating the output of the model using another machine learning algorithm that takes the remaining features or a subset of them to adjust the first layer predictions while minimizing a custom loss. This hierarchical structure provides novelty by allowing for flexible depth in each layer and suitability to any loss function. We demonstrate the effectiveness of our approach on the synthetic and M4 competition datasets. Overall, the proposed hierarchical ensemble approach for feature selection offers a robust and scalable solution to the challenges posed by datasets with high dimensional feature sets compared to the sample size. Effectively capturing feature co-dependency and showcasing enhanced accuracy and stability in machine learning models, our method outperforms traditional and state-of-the-art machine learning models. We also provide the source code of our approach to facilitate further research and replicability of our results.

Acknowledgements This work is in part supported by the Turkish Academy of Sciences Outstanding Researcher Program.

Author Contributions AT: Conceptualization, Methodology, Software, Investigation, Resources, Validation, Visualization, Writing- original draft, review. MEA: Conceptualization, Investigation, Resources, Validation, Writing- original draft, review. ATK: Conceptualization, Investigation, Supervision, Validation, Writing, review, editing. SSK: Conceptualization, Investigation, Methodology, Supervision, Validation, Writing, review, editing.

Funding Not applicable.

Data availability The synthetically generated data that support the findings of this study is available at https://github.com/aysintumay/hierarchical-feature-selection. The extraction process of the synthetic dataset is explained in the paper in detail. The real-life data that support the findings of this study is openly available at https://www.kaggle.com/datasets/yogesh94/m4-forecasting-competition-dataset.

Code Availability The code of this work is publicly available on https://github.com/aysintumay/hierarchical-feature-selection.

Declarations

Competing Interests: The authors declare that they have no competing financial or non-financial interests that could have influenced the presented work in this paper.
Ethics Approval: The authors approve that there are no ethical concerns related to the work presented in this paper.
Consent to Participate: All authors agreed on the content and explicitly consented to the submission of the paper.
Consent for Publication: All authors who participated in this study give the publisher permission to publish this work.

References

\bibcommenthead

Bellman [\APACyear1957] \APACinsertmetastarbellman1957dynamic{APACrefauthors}Bellman, R.E. \APACrefYear1957. \APACrefbtitleDynamic Programming Dynamic programming. \APACaddressPublisherPrinceton, NJPrinceton University Press. \PrintBackRefs\CurrentBib

Bolón-Canedo \BBA Alonso-Betanzos [\APACyear2019\APACexlab\BCnt1] \APACinsertmetastarBOLONCANEDO20191{APACrefauthors}Bolón-Canedo, V.\BCBT \BBA Alonso-Betanzos, A. \APACrefYearMonthDay2019\BCnt1. \BBOQ\APACrefatitleEnsembles for feature selection: A review and future trends Ensembles for feature selection: A review and future trends.\BBCQ \APACjournalVolNumPagesInformation Fusion521-12, {APACrefDOI} https://doi.org/https://doi.org/10.1016/j.inffus.2018.11.008 {APACrefURL} https://www.sciencedirect.com/science/article/pii/S1566253518303440 \PrintBackRefs\CurrentBib

Bolón-Canedo \BBA Alonso-Betanzos [\APACyear2019\APACexlab\BCnt2] \APACinsertmetastarbolon2019ensembles{APACrefauthors}Bolón-Canedo, V.\BCBT \BBA Alonso-Betanzos, A. \APACrefYearMonthDay2019\BCnt2. \BBOQ\APACrefatitleEnsembles for Feature Selection: A Review and Future Trends Ensembles for feature selection: A review and future trends.\BBCQ \APACjournalVolNumPagesInformation Fusion521–12, {APACrefDOI} https://doi.org/10.1016/j.inffus.2018.11.008 \PrintBackRefs\CurrentBib

Bolón-Canedo \BOthers. [\APACyear2014] \APACinsertmetastarbolon2014data{APACrefauthors}Bolón-Canedo, V., Sánchez-Maroño, N.\BCBL Alonso-Betanzos, A. \APACrefYearMonthDay2014. \BBOQ\APACrefatitleData Classification Using an Ensemble of Filters Data classification using an ensemble of filters.\BBCQ \APACjournalVolNumPagesNeurocomputing13513–20, {APACrefDOI} https://doi.org/10.1016/j.neucom.2013.03.067 \PrintBackRefs\CurrentBib

Box \BBA Jenkins [\APACyear1970] \APACinsertmetastarbox1970time{APACrefauthors}Box, G.E.P.\BCBT \BBA Jenkins, G.M. \APACrefYear1970. \APACrefbtitleTime Series Analysis: Forecasting and Control Time series analysis: Forecasting and control. \APACaddressPublisherSan FranciscoHolden-Day. \PrintBackRefs\CurrentBib

Breiman [\APACyear2001] \APACinsertmetastarbreiman2001random{APACrefauthors}Breiman, L. \APACrefYearMonthDay2001. \BBOQ\APACrefatitleRandom Forests Random forests.\BBCQ \APACjournalVolNumPagesMachine Learning455–32, {APACrefDOI} https://doi.org/10.1023/A:1010933404324 \PrintBackRefs\CurrentBib

Cortes \BBA Vapnik [\APACyear1995] \APACinsertmetastarsupport_vector_networks{APACrefauthors}Cortes, C.\BCBT \BBA Vapnik, V. \APACrefYearMonthDay1995. \BBOQ\APACrefatitleSupport Vector Networks Support vector networks.\BBCQ \APACjournalVolNumPagesMachine Learning20273-297, \PrintBackRefs\CurrentBib

Das [\APACyear2001] \APACinsertmetastardas2001filters{APACrefauthors}Das, S. \APACrefYearMonthDay2001. \BBOQ\APACrefatitleFilters, Wrappers and a Boosting-Based Hybrid for Feature Selection Filters, wrappers and a boosting-based hybrid for feature selection.\BBCQ \APACrefbtitleProceedings of the International Conference on Machine Learning. Proceedings of the international conference on machine learning. \APACaddressPublisherUSA. \PrintBackRefs\CurrentBib

Dickey \BBA Fuller [\APACyear1979] \APACinsertmetastardickey1979distribution{APACrefauthors}Dickey, D.A.\BCBT \BBA Fuller, W.A. \APACrefYearMonthDay1979. \BBOQ\APACrefatitleDistribution of the estimators for autoregressive time series with a unit root Distribution of the estimators for autoregressive time series with a unit root.\BBCQ \APACjournalVolNumPagesJournal of the American Statistical Association74366a427–431, \PrintBackRefs\CurrentBib

Du [\APACyear2019] \APACinsertmetastarml_models_favoring_yt_relateds{APACrefauthors}Du, M. \APACrefYearMonthDay2019. \BBOQ\APACrefatitleImproving LSTM Neural Networks for Better Short-Term Wind Power Predictions Improving lstm neural networks for better short-term wind power predictions.\BBCQ \APACrefbtitle2019 IEEE 2nd International Conference on Renewable Energy and Power Engineering (REPE) 2019 ieee 2nd international conference on renewable energy and power engineering (repe) (\BPG 105-109). \PrintBackRefs\CurrentBib

Friedman [\APACyear1997] \APACinsertmetastarfriedman1997bias{APACrefauthors}Friedman, J.H. \APACrefYearMonthDay1997. \BBOQ\APACrefatitleOn Bias, Variance, 0/1—Loss, and the Curse-of-Dimensionality On bias, variance, 0/1—loss, and the curse-of-dimensionality.\BBCQ \APACjournalVolNumPagesData Mining and Knowledge Discovery155–77, {APACrefDOI} https://doi.org/10.1023/A:1009778005914 \PrintBackRefs\CurrentBib

Fumagalli \BOthers. [\APACyear2023] \APACinsertmetastarFumagalli2022iPFI{APACrefauthors}Fumagalli, F., Muschalik, M., Hüllermeier, E.\BCBL Hammer, B. \APACrefYearMonthDay2023. \BBOQ\APACrefatitleIncremental Permutation Feature Importance (iPFI): Towards Online Explanations on Data Streams Incremental permutation feature importance (ipfi): Towards online explanations on data streams.\BBCQ \APACjournalVolNumPagesMachine Learning, {APACrefURL} https://doi.org/10.1007/s10994-023-06385-y \PrintBackRefs\CurrentBib

Hancer [\APACyear2021] \APACinsertmetastarHancer2021ImprovedEvolutionary{APACrefauthors}Hancer, E. \APACrefYearMonthDay2021. \BBOQ\APACrefatitleAn improved evolutionary wrapper-filter feature selection approach with a new initialisation scheme An improved evolutionary wrapper-filter feature selection approach with a new initialisation scheme.\BBCQ \APACjournalVolNumPagesMachine Learning, {APACrefDOI} https://doi.org/10.1007/s10994-021-05990-z {APACrefURL} https://www.mendeley.com/catalogue/53f9ff12-9a2d-3032-94d7-188d3887570d/ \PrintBackRefs\CurrentBib

Jenul \BOthers. [\APACyear2022] \APACinsertmetastarJenul2021UBayFS{APACrefauthors}Jenul, A., Schrunner, S., Pilz, J.\BCBL Tomic, O. \APACrefYearMonthDay2022. \BBOQ\APACrefatitleA User-Guided Bayesian Framework for Ensemble Feature Selection in Life Science Applications (UBayFS) A user-guided bayesian framework for ensemble feature selection in life science applications (ubayfs).\BBCQ \APACjournalVolNumPagesMachine Learning, {APACrefURL} https://doi.org/10.1007/s10994-022-06221-9 \PrintBackRefs\CurrentBib

Ke \BOthers. [\APACyear2017] \APACinsertmetastarke2017lightgbm{APACrefauthors}Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W.\BDBLLiu, T\BHBIY. \APACrefYearMonthDay2017. \BBOQ\APACrefatitleLightGBM: a highly efficient gradient boosting decision tree Lightgbm: a highly efficient gradient boosting decision tree.\BBCQ \APACrefbtitleProceedings of the 31st International Conference on Neural Information Processing Systems Proceedings of the 31st international conference on neural information processing systems (\BPG 3149–3157). \APACaddressPublisherRed Hook, NY, USACurran Associates Inc. \PrintBackRefs\CurrentBib

Kohavi \BBA John [\APACyear1997] \APACinsertmetastarkohavi1997wrappers{APACrefauthors}Kohavi, R.\BCBT \BBA John, G.H. \APACrefYearMonthDay1997. \BBOQ\APACrefatitleWrappers for feature subset selection Wrappers for feature subset selection.\BBCQ \APACjournalVolNumPagesArtificial Intelligence971–2273–324, {APACrefDOI} https://doi.org/10.1016/s0004-3702(97)00043-x \PrintBackRefs\CurrentBib

Lee \BOthers. [\APACyear2020] \APACinsertmetastarwind{APACrefauthors}Lee, J., Wang, W., Harrou, F.\BCBL Sun, Y. \APACrefYearMonthDay2020. \BBOQ\APACrefatitleWind Power Prediction Using Ensemble Learning-Based Models Wind power prediction using ensemble learning-based models.\BBCQ \APACjournalVolNumPagesIEEE Access861517-61527, {APACrefDOI} https://doi.org/10.1109/ACCESS.2020.2983234 \PrintBackRefs\CurrentBib

Lim \BOthers. [\APACyear2021] \APACinsertmetastarArik_2{APACrefauthors}Lim, B., Arik, S.O., Loeff, N.\BCBL Pfister, T. \APACrefYearMonthDay2021. \BBOQ\APACrefatitleTemporal Fusion Transformers for interpretable multi-horizon time series forecasting Temporal fusion transformers for interpretable multi-horizon time series forecasting.\BBCQ \APACjournalVolNumPagesInternational Journal of Forecasting3741748-1764, {APACrefDOI} https://doi.org/https://doi.org/10.1016/j.ijforecast.2021.03.012 {APACrefURL} https://www.sciencedirect.com/science/article/pii/S0169207021000637 \PrintBackRefs\CurrentBib

Lin \BOthers. [\APACyear2019] \APACinsertmetastarsymmetric_uncertainty{APACrefauthors}Lin, X., Li, C., Ren, W., Luo, X.\BCBL Qi, Y. \APACrefYearMonthDay2019. \BBOQ\APACrefatitleA new feature selection method based on symmetrical uncertainty and interaction gain A new feature selection method based on symmetrical uncertainty and interaction gain.\BBCQ \APACjournalVolNumPagesComputational Biology and Chemistry83107149, {APACrefDOI} https://doi.org/https://doi.org/10.1016/j.compbiolchem.2019.107149 {APACrefURL} https://www.sciencedirect.com/science/article/pii/S1476927118303736 \PrintBackRefs\CurrentBib

Natekin \BBA Knoll [\APACyear2013] \APACinsertmetastarnatekin2013gradient{APACrefauthors}Natekin, A.\BCBT \BBA Knoll, A. \APACrefYearMonthDay2013Dec.. \BBOQ\APACrefatitleGradient Boosting Machines, a tutorial Gradient boosting machines, a tutorial.\BBCQ \APACjournalVolNumPagesFrontiers in Neurorobotics7, {APACrefDOI} https://doi.org/10.3389/fnbot.2013.00021 \PrintBackRefs\CurrentBib

Pearson [\APACyear1896] \APACinsertmetastarpearson1896mathematical{APACrefauthors}Pearson, K. \APACrefYearMonthDay1896. \BBOQ\APACrefatitleMathematical Contributions to the Theory of Evolution. On a Form of Spurious Correlation Which May Arise When Indices Are Used in the Measurement of Organs Mathematical contributions to the theory of evolution. on a form of spurious correlation which may arise when indices are used in the measurement of organs.\BBCQ \APACjournalVolNumPagesProceedings of the Royal Society of London60489-498, \PrintBackRefs\CurrentBib

Pearson [\APACyear1901] \APACinsertmetastarpearson1901lines{APACrefauthors}Pearson, K. \APACrefYearMonthDay1901. \BBOQ\APACrefatitleLIII. on lines and planes of closest fit to systems of points in space Liii. on lines and planes of closest fit to systems of points in space.\BBCQ \APACjournalVolNumPagesThe London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science211559–572, {APACrefDOI} https://doi.org/10.1080/14786440109462720 \PrintBackRefs\CurrentBib

Quinlan [\APACyear1986] \APACinsertmetastarquinlan1986induction{APACrefauthors}Quinlan, J.R. \APACrefYearMonthDay1986. \BBOQ\APACrefatitleInduction of decision trees Induction of decision trees.\BBCQ \APACjournalVolNumPagesMachine Learning181–106, {APACrefDOI} https://doi.org/10.1007/BF00116251 \PrintBackRefs\CurrentBib

Saeys \BOthers. [\APACyear2008] \APACinsertmetastarsaeys2008robust{APACrefauthors}Saeys, Y., Abeel, T.\BCBL Van de Peer, Y. \APACrefYearMonthDay2008. \BBOQ\APACrefatitleRobust Feature Selection Using Ensemble Feature Selection Techniques Robust feature selection using ensemble feature selection techniques.\BBCQ \APACrefbtitleMachine Learning and Knowledge Discovery in Databases Machine learning and knowledge discovery in databases (\BPGS 313–325). \PrintBackRefs\CurrentBib

Saeys \BOthers. [\APACyear2007] \APACinsertmetastarsaeys2007review{APACrefauthors}Saeys, Y., Inza, I.\BCBL Larrañaga, P. \APACrefYearMonthDay2007. \BBOQ\APACrefatitleA review of feature selection techniques in bioinformatics A review of feature selection techniques in bioinformatics.\BBCQ \APACjournalVolNumPagesBioinformatics23192507–2517, {APACrefDOI} https://doi.org/10.1093/bioinformatics/btm344 \PrintBackRefs\CurrentBib

Sanz \BOthers. [\APACyear2018] \APACinsertmetastarsvm_rfe{APACrefauthors}Sanz, H., Valim, C., Vegas, E., Oller, J.\BCBL Reverter, F. \APACrefYearMonthDay201811. \BBOQ\APACrefatitleSVM-RFE: Selection and visualization of the most relevant features through non-linear kernels Svm-rfe: Selection and visualization of the most relevant features through non-linear kernels.\BBCQ \APACjournalVolNumPagesBMC Bioinformatics19, {APACrefDOI} https://doi.org/10.1186/s12859-018-2451-4 \PrintBackRefs\CurrentBib

Seijo-Pardo \BOthers. [\APACyear2019] \APACinsertmetastarseijo2019develo**{APACrefauthors}Seijo-Pardo, B., Bolón-Canedo, V.\BCBL Alonso-Betanzos, A. \APACrefYearMonthDay2019. \BBOQ\APACrefatitleOn Develo** an Automatic Threshold Applied to Feature Selection Ensembles On develo** an automatic threshold applied to feature selection ensembles.\BBCQ \APACjournalVolNumPagesInformation Fusion45227–245, {APACrefDOI} https://doi.org/10.1016/j.inffus.2018.02.007 \PrintBackRefs\CurrentBib

Tibshirani [\APACyear1996] \APACinsertmetastartibshirani1996regression{APACrefauthors}Tibshirani, R. \APACrefYearMonthDay1996January. \BBOQ\APACrefatitleRegression Shrinkage and Selection Via the Lasso Regression shrinkage and selection via the lasso.\BBCQ \APACjournalVolNumPagesJournal of the Royal Statistical Society: Series B (Methodological)581267–288, {APACrefDOI} https://doi.org/10.1111/j.2517-6161.1996.tb02080.x \PrintBackRefs\CurrentBib

Urbanowicz \BOthers. [\APACyear2018] \APACinsertmetastarrelief{APACrefauthors}Urbanowicz, R.J., Meeker, M., La Cava, W., Olson, R.S.\BCBL Moore, J.H. \APACrefYearMonthDay2018. \BBOQ\APACrefatitleRelief-based feature selection: Introduction and review Relief-based feature selection: Introduction and review.\BBCQ \APACjournalVolNumPagesJournal of Biomedical Informatics85189-203, {APACrefDOI} https://doi.org/https://doi.org/10.1016/j.jbi.2018.07.014 {APACrefURL} https://www.sciencedirect.com/science/article/pii/S1532046418301400 \PrintBackRefs\CurrentBib

Verleysen \BBA François [\APACyear2005] \APACinsertmetastarsome_other_curseofdim{APACrefauthors}Verleysen, M.\BCBT \BBA François, D. \APACrefYearMonthDay2005. \BBOQ\APACrefatitleThe Curse of Dimensionality in Data Mining and Time Series Prediction The curse of dimensionality in data mining and time series prediction.\BBCQ J. Cabestany, A. Prieto\BCBL \BBA F. Sandoval (\BEDS), \APACrefbtitleComputational Intelligence and Bioinspired Systems Computational intelligence and bioinspired systems (\BPGS 758–770). \APACaddressPublisherBerlin, HeidelbergSpringer Berlin Heidelberg. \PrintBackRefs\CurrentBib

Yogesh [\APACyear2020] \APACinsertmetastaryogeshm4{APACrefauthors}Yogesh, S. \APACrefYearMonthDay2020. \APACrefbtitleM4 Forecasting Competition Dataset. M4 forecasting competition dataset. \APAChowpublishedKaggle. {APACrefURL} https://www.kaggle.com/datasets/yogesh94/m4-forecasting-competition-dataset \APACrefnoteAccessed on Apr. 1, 2023 \PrintBackRefs\CurrentBib

Yu \BOthers. [\APACyear2023] \APACinsertmetastarArik_1{APACrefauthors}Yu, Q.R., Wang, R., Arik, S.\BCBL Dong, Y. \APACrefYearMonthDay2023. \BBOQ\APACrefatitleKoopman Neural Forecaster for Time-series with Temporal Distribution Shifts Koopman neural forecaster for time-series with temporal distribution shifts.\BBCQ \APACrefbtitleProceedings of ICLR. Proceedings of iclr. \PrintBackRefs\CurrentBib

Škrlj \BOthers. [\APACyear2021] \APACinsertmetastarSkrlj2021ReliefE{APACrefauthors}Škrlj, B., Džeroski, S., Lavrač, N.\BCBL Petković, M. \APACrefYearMonthDay2021. \BBOQ\APACrefatitleReliefE: Feature Ranking in High-dimensional Spaces via Manifold Embeddings Reliefe: Feature ranking in high-dimensional spaces via manifold embeddings.\BBCQ \APACjournalVolNumPagesMachine Learning, {APACrefURL} https://doi.org/10.1007/s10994-021-05998-5 \PrintBackRefs\CurrentBib