HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.

  • failed: manyfoot

Authors: achieve the best HTML results from your LaTeX submissions by following these best practices.

License: arXiv.org perpetual non-exclusive license
arXiv:2310.17544v2 [cs.LG] 21 Jan 2024

[1]\fnmAysin \surTumay

1]\orgdivDepartment of Electrical and Electronics Engineering, \orgnameBilkent University, \cityAnkara, \postcode06800, \countryTurkey

Hierarchical Ensemble-Based Feature Selection for Time Series Forecasting

[email protected]    \fnmMustafa E. \surAydin [email protected]    \fnmAli T. \surKoc [email protected]    \fnmSuleyman S. \surKozat [email protected] [
Abstract

We introduce a novel ensemble approach for feature selection based on hierarchical stacking for non-stationarity and/or a limited number of samples with a large number of features. Our approach exploits the co-dependency between features using a hierarchical structure. Initially, a machine learning model is trained using a subset of features, and then the output of the model is updated using other algorithms in a hierarchical manner with the remaining features to minimize the target loss. This hierarchical structure allows for flexible depth and feature selection. By exploiting feature co-dependency hierarchically, our proposed approach overcomes the limitations of traditional feature selection methods and feature importance scores. The effectiveness of the approach is demonstrated on synthetic and the well-known real-life datasets, providing significant scalable and stable performance improvements compared to the traditional methods and the state-of-the-art approaches. We also provide the source code of our approach to facilitate further research and replicability of our results.

keywords:
feature selection, ensemble learning, the curse of dimensionality, hierarchical stacking, light gradient boosting machine (LightGBM), time series forecasting.

1 Introduction

We study feature selection for time series regression/prediction/forecasting tasks for settings where the number of features is large compared to the number of samples. This problem is extensively studied in the machine learning literature as it relates to the infamous “curse of dimensionality” phenomenon, which suggests that machine learning models tend to struggle in cases where the number of samples is not sufficient given the number of features for effective learning from data [2, 30]. This results in over-training to obtain a model with high variance, i.e., low generalization ability [11]. This feature selection problem is even more prominent in non-stationary environments, e.g., for time series data or drifting statistics, where the trend, or the relationship between the features and the desired output changes significantly over time, making it challenging to identify relevant features.

Generally, the problem is addressed by i) considering all subsets of the features such as wrappers [16], ii) using feature characteristics such as filters [23], iii) embedding feature selection into the model learning process such as Lasso Linear Regression [28], Random Forest [6], and iv) feature extraction methods such as Principal Component Analysis (PCA) [22]. However, these methods are not effectively utilizing the information of the data, e.g., simply exploiting the “dominant” features without exploring others, and/or not scalable in reduction as the number of dimensions grows, i.e., computationally inefficient or not dynamic enough to adapt well as the domain of the features varies depending on the task. These widely used methods are also univariate, considering each feature once in calculating their importance. Directly evaluating all the subsets of the features for a given data becomes an NP-hard problem as the number of features grows, rendering such techniques, e.g., wrappers, computationally infeasible since the number of subsets reaches over a billion when the number of features exceeds thirty. Embedding selection methodology into modeling, e.g., feature selection based on feature importance scores is also inadequate since, with limited data, these scores are unreliable giving vague explanations about gain or split-based selection for tree-based models [20]. One possible solution is to use ensemble or bagging techniques, where different machine learning algorithms are trained on different subsets of the feature vectors. However, this approach also leads to losing co-dependency information between features. Lastly, unsupervised feature extraction techniques such as PCA suffer from not incorporating valuable information from the underlying task where the original task, be it regression or classification, is supervised.

Here, we introduce a highly effective and versatile hierarchical stacking-based novel ensembling approach to this problem, where we first train an initial machine learning model using a subset of the full features and then a second machine learning model in the hierarchy takes the remaining features as inputs using only the outcome of the first model on the cost function. The hierarchy increases until either the features exhaust and/or a user-controlled hierarchy depth is reached. Therefore, this generic structure allows for the depth of the hierarchy to be a design parameter, as well as the features used in each layer. By exploiting the co-dependency between features in a hierarchical manner, our approach addresses the limitations of traditional feature selection methods and provides more reliable results than feature importance scores. We build upon the substantial work demonstrated in the ensemble feature selection domain and illustrate the success of our approach on the synthetic and well-known real-world datasets in terms of accuracy, robustness, and scalability.

2 Related Work

Filters [8] and wrappers [16] are the prevalent traditional feature selection methods in use. While the former utilizes statistical tests such as chi-square, information gain, and mutual information, wrappers recursively do forward selection or backward elimination to the current feature subset according to an evaluation metric assigned to the model. The forward selection has a hard time finding good co-predictors while it is faster than backward elimination resulting in better scalability to larger datasets. In situations where wrappers overfit, filters are used with the knowledge of statistical tests [8]. Filters are fast in computation, easy to be scaled to higher dimensional datasets, and independent of the model while the dependencies between features are ignored. Wrappers, on the other hand, interact with the model and model the feature dependencies while they tend to require more computational resources since all feature subsets are tedious to try compared to filters. Saeys et al. [25] suggest more advanced methods such as an ensemble of feature selection methods and deep learning for feature selection. Moreover, Hancer [13] proposes a wrapper-filter feature selection method using fuzzy mutual information that overcomes standard mutual information’s limitations. Our approach differentiates from wrappers, filters, and the methods of Saeys et al. and Hancer since we propose a multivariate solution that processes the groups of features by leveraging the codependency in the groups of features.

Boosting methods in feature selection as suggested by Das [8] are equipped with boosted decision trees where the metric is information gain and weak learners are decision stumps. They perform well with the help of increasing the weight of each high-loss decision stump. This paper also proposes a hybrid method that uses a filter method for initial feature ranking and selection, followed by a wrapper method that evaluates the selected features using a classification model, combining the high accuracy of wrappers and time efficiency of filters [25]. Even though the hybrid model of wrappers and filters generates a more efficient model, our method is more time-efficient and utilizes the statistical importance of features as well as the prior knowledge of the side information. On top of these, the gradient boosting models are not generic for every loss function since these models require the hessian of the loss function to be nonzero. Our approach can integrate any external loss function in the middle of the system, independent of the boosting algorithm, which brings a novel solution to the problem.

The methods such as feature importance extracted by SVM [7] or Random Forest [6] algorithms and unsupervised feature extraction methods are not task-specific and might be inadequate to exploit domain knowledge in many areas. As for leveraging the ensemble approach, Saeys et al. [24] investigate the impact of four different techniques of feature selection namely Symmetric Uncertainty [19] for univariate cases, RELIEF [29] for multivariate cases, feature importance measures of Random Forest and SVMs, and finally SVM-based Recursive Feature Elimination [26]. Besides, the ensembles from a bagging point of view are made with perturbations for each technique. The classification performance and robustness are merged into an F1 measure with a custom weight which results in choosing an ensemble of Random Forest [24]. While feature importance scores of tree-based models bring insight into the dataset, they ignore the correlations between features since each feature is individually analyzed. As an improvement on the RELIEF algorithm, Škrlj et al. [33] developed an approach of ReliefE that suits high-dimensional sparse input spaces. Their solution lies in adapting manifold-based embeddings of feature and target space to multi-class classification problems. Although this method proposes a context-dependent space complexity, we enable a determined smaller complexity depending on the size of the input data and the number of layers in the iterated algorithm.

One of the commonly used feature selection methods is a tree ensemble of Random Forest to determine a threshold to slice the feature subset based on the average information gain of each node [3]. There are also threshold determination methods using data complexity measures. Seijo-Pardo et al. discuss an automatic method to determine the threshold, unlike other methods that are task-specific [27]. This method takes the weighted average of the complexity measures such as the Maximum Fisher discriminant ratio (F1), the volume of overlap region (F2), maximum feature efficiency (F3), complexity measures (CF) and the percentage of features retained. After different ranking methods from filters and embedded models are combined with the min-combination method, which selects the minimum of the relevance values coming from each ranking [4], the threshold is determined with one of the complexity measures. Then, the classification is completed with the selected subset of features. Another design given in this work applies thresholding to each ranking method. Having the thresholded sets, the rest is the same as the first design. All in all, an automatic thresholding method performs better than fixed thresholding methods [27]. Although Random Forest with a varying thresholding mechanism may work better than ordinary model, the issue of univariate feature importance scores, which we tackle thoroughly by kee** informative features together, is not solved in this proposal. Another novel method proposed by Fumagalli et al. [12] is incremental permutation feature importance (iPFI) which is an online variation of the batch permutation method offering two sampling strategies to calculate marginal feature distributions. However, the lack of human-grounded experiments is compensated in our simulations. Additionally, the algorithm lacks in considering co-dependency between feature subsets, indicating univariance.

The novel approach suggested by Jenul et al. [14] parallels our aim of incorporating domain knowledge as well as the dominant data itself. They employ a user-guided ensemble feature selector which includes the likelihood approach, prior weights from expert knowledge, and side constraints as regularization. Afterward, these components are combined with an optimization rule. Although our approach echoes this method in prior weights in expert knowledge and the optimization loop, we overcome the limitations of loss functions. We employ a flexible optimization loop which extends to any possible customized loss.

Overall, the contemporary literature addresses the feature selection either in a disjoint manner, i.e., in a model/domain-independent fashion or via favoring the seemingly dominant features while failing to explore majority of the features. Unlike previous studies, our model as the first time in the literature fully exploits the co-dependency between features via a novel hierarchical ensemble-based approach. This allows for an adaptive, i.e., dynamic feature selection, which is especially useful in nonstationary environments, e.g., in time series settings. The introduced architecture is generic, i.e., both the depth of the hierarchy and the base models employed are user-controlled. As shown in our simulations, the introduced model provides significant improvements in performance as well as scalability over the well-known real life competition datasets compared to the traditional as well as the state-of-the-art feature selection approaches. We publicly share the implementation of our algorithm for both model design, comparisons, and experimental reproducibility 111https://github.com/aysintumay/hierarchical-feature-selection.

3 Problem Description

All vectors in this paper are column vectors in lowercase boldface type. Matrices are denoted by uppercase boldface letters. Specifically, 𝑿𝑿\boldsymbol{X}bold_italic_X represents a matrix containing xt(k)subscriptsuperscript𝑥𝑘𝑡{x}^{(k)}_{t}italic_x start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, i.e., the kthsuperscript𝑘𝑡k^{th}italic_k start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT sequence of vector 𝒙tsubscript𝒙𝑡\boldsymbol{x}_{t}bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, and x(k)superscript𝑥𝑘{x}^{(k)}italic_x start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT in the kthsuperscript𝑘𝑡k^{th}italic_k start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT column for each time t𝑡titalic_t. Xi,jsubscript𝑋𝑖𝑗X_{i,j}italic_X start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT denotes the element of 𝑿𝑿\boldsymbol{X}bold_italic_X in the ithsuperscript𝑖𝑡i^{th}italic_i start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT row and jthsuperscript𝑗𝑡j^{th}italic_j start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT column. Ordinary transposes of 𝒙tsubscript𝒙𝑡\boldsymbol{x}_{t}bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and 𝒙𝒙\boldsymbol{x}bold_italic_x are denoted as 𝒙tTsubscriptsuperscript𝒙𝑇𝑡\boldsymbol{x}^{T}_{t}bold_italic_x start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and 𝒙Tsuperscript𝒙𝑇\boldsymbol{x}^{T}bold_italic_x start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT, respectively. The mean and standard deviation of xt(k)superscriptsubscript𝑥𝑡𝑘{x}_{t}^{(k)}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT, i.e., the kthsuperscript𝑘𝑡k^{th}italic_k start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT dimension of 𝒙tsubscript𝒙𝑡\boldsymbol{x}_{t}bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, are denoted by x¯t(k)superscriptsubscript¯𝑥𝑡𝑘\bar{{x}}_{t}^{(k)}over¯ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT and σ(xt(k))𝜎superscriptsubscript𝑥𝑡𝑘\sigma({x}_{t}^{(k)})italic_σ ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ), respectively. The covariance of a time series {yt}subscript𝑦𝑡\{y_{t}\}{ italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } with its k𝑘kitalic_k times delayed version is represented as γy(t,t+k)=cov(yt,ytk)subscript𝛾𝑦𝑡𝑡𝑘𝑐𝑜𝑣subscript𝑦𝑡subscript𝑦𝑡𝑘\gamma_{y}(t,t+k)=cov({y}_{t},{y}_{t-k})italic_γ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ( italic_t , italic_t + italic_k ) = italic_c italic_o italic_v ( italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_t - italic_k end_POSTSUBSCRIPT ). The gamma function, having the property of Γ(N)=(N1)!Γ𝑁𝑁1\Gamma(N)=(N-1)!roman_Γ ( italic_N ) = ( italic_N - 1 ) ! generalizes the concept of a factorial to real and complex numbers.

We study feature selection in time series prediction of a sequence {yt}subscript𝑦𝑡\{y_{t}\}{ italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT }. We observe this target sequence along with a side information sequence (or feature vectors) 𝒙tsubscript𝒙𝑡\boldsymbol{x}_{t}bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT each of which is of size M𝑀Mitalic_M. At each time t𝑡titalic_t, given the past information {yk}subscript𝑦𝑘\{y_{k}\}{ italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT }, and 𝒙ksubscript𝒙𝑘\boldsymbol{x}_{k}bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT for kt𝑘𝑡k\leq titalic_k ≤ italic_t, we produce the output y^t+1subscript^𝑦𝑡1\hat{y}_{t+1}over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT. Hence, in this setting, our goal is to find the relationship

y^t+1=Ft({\displaystyle\hat{y}_{t+1}=F_{t}\big{(}\{over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = italic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( { y1,y1,,yt},{𝒙1,𝒙2,,𝒙t}),\displaystyle y_{1},y_{1},\ldots,y_{t}\},\{\boldsymbol{x}_{1},\boldsymbol{x}_{% 2}\ldots,,\boldsymbol{x}_{t}\}\big{)},italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } , { bold_italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT … , , bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } ) ,

where Ftsubscript𝐹𝑡F_{t}italic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is an unknown function of time, which models y^t+1subscript^𝑦𝑡1\hat{y}_{t+1}over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT. We introduce a hierarchical nonlinear ensemble model for Ftsubscript𝐹𝑡F_{t}italic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT with an efficient feature selection procedure integrated in. Throughout the training of the model, we suffer the cumulative loss

L=t=1N(yt,y^t),𝐿superscriptsubscript𝑡1𝑁subscript𝑦𝑡subscript^𝑦𝑡L=\sum_{t=1}^{N}\ell(y_{t},\hat{y}_{t}),italic_L = ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT roman_ℓ ( italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ,

where N𝑁Nitalic_N is the number of data points and \ellroman_ℓ can be, for example, the squared error loss, i.e., (yt,y^t)=(yty^t)2subscript𝑦𝑡subscript^𝑦𝑡superscriptsubscript𝑦𝑡subscript^𝑦𝑡2\ell(y_{t},\hat{y}_{t})=(y_{t}-\hat{y}_{t})^{2}roman_ℓ ( italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = ( italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT.

To exemplify the problem and illustrate the significance of {ytj}subscript𝑦𝑡𝑗\{y_{t-j}\}{ italic_y start_POSTSUBSCRIPT italic_t - italic_j end_POSTSUBSCRIPT } for some j>0𝑗0j>0italic_j > 0 values, i.e., lagged target sequence among features, we give the example of wind energy production prediction [17]. This task provides information from {ytj}subscript𝑦𝑡𝑗\{y_{t-j}\}{ italic_y start_POSTSUBSCRIPT italic_t - italic_j end_POSTSUBSCRIPT } and features generated from it, also called ytsubscript𝑦𝑡y_{t}italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT-related sequences, and weather conditions. Weather-based features constitute the side information, indirectly related to {yt}subscript𝑦𝑡\{y_{t}\}{ italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT }. As shown in the wind prediction literature review [17], the ytsubscript𝑦𝑡y_{t}italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT-related features are the dominant features. The predictions of the model that utilize both ytsubscript𝑦𝑡y_{t}italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT-related features, as well as weather information, largely follow the long-term patterns of the former, i.e., the weather-based side information is hardly utilized as illustrated in Section 5. Additionally, this model is prone to overfitting due to its high dimensionality [11]. We aim to effectively incorporate weather-based features into the model, as they can capture short-term abnormalities caused by ytsubscript𝑦𝑡y_{t}italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT-related features. In this sense, our approach initially shapes the long-term patterns with dominant feature vectors. Then, it fine-tunes them with the side information through cost optimization, resulting in using the side information and generating a multivariate solution. Note that this phenomenon happens in most real-life applications, where due to the dominance of certain features, the other features are hardly utilized.

Formally, at time step t𝑡titalic_t, we have a matrix 𝑿𝑿\boldsymbol{X}bold_italic_X with a length N𝑁Nitalic_N and dimensionality M𝑀Mitalic_M, i.e., the N𝑁Nitalic_N data points of 𝑿𝑿\boldsymbol{X}bold_italic_X form an Euclidean space of M𝑀Mitalic_M-dimensions. Considering this hypersphere, we generalize the distance between data points in the M𝑀Mitalic_M-dimensional hypersphere as d2r𝑑2𝑟d\approx 2\cdot ritalic_d ≈ 2 ⋅ italic_r, where r𝑟ritalic_r is the radius of the hypersphere. If M𝑀Mitalic_M approaches N𝑁Nitalic_N, the volume of the hypersphere, Vn(r)subscript𝑉𝑛𝑟V_{n}(r)italic_V start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_r ), given by:

Vn(r)=πM2Γ(M2+1)rMsubscript𝑉𝑛𝑟superscript𝜋𝑀2Γ𝑀21superscript𝑟𝑀\centering V_{n}(r)=\frac{\pi^{\frac{M}{2}}}{\Gamma\left(\frac{M}{2}+1\right)}% r^{M}\@add@centeringitalic_V start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_r ) = divide start_ARG italic_π start_POSTSUPERSCRIPT divide start_ARG italic_M end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT end_ARG start_ARG roman_Γ ( divide start_ARG italic_M end_ARG start_ARG 2 end_ARG + 1 ) end_ARG italic_r start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT (1)

increases exponentially, leading to a sparser data distribution characterized by a larger d𝑑ditalic_d [1].

We suffer data sparsity also due to non-stationarity, which is generally caused by trends and seasonality, while conducting time series forecasting for target sequence {yt}subscript𝑦𝑡\{y_{t}\}{ italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT }. As such trends and seasonality reduce the relative number of unique data points, which can lead to overfitting.

To this end, we propose a hierarchical ensemble-based feature selection method for the time series forecasting task to overcome overfitting and non-stationarity. We split 𝒙tsubscript𝒙𝑡\boldsymbol{x}_{t}bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT into K𝐾Kitalic_K non-intersecting feature subsets 𝒔t(k),k=1,,Kformulae-sequencesuperscriptsubscript𝒔𝑡𝑘𝑘1𝐾\boldsymbol{s}_{t}^{(k)},k=1,...,Kbold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT , italic_k = 1 , … , italic_K based on domain knowledge or using certain heuristics such as feature importance metrics as demonstrated in our experiments. After this split, we train K𝐾Kitalic_K machine learning models, in a dependent manner, that take each 𝒔t(k)superscriptsubscript𝒔𝑡𝑘\boldsymbol{s}_{t}^{(k)}bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT as input in a hierarchical order by optimizing a cost function after each model until the Kthsuperscript𝐾𝑡K^{th}italic_K start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT one. Before describing our approach, we discuss the well-known approaches to this problem for completeness in the following.

3.1 Common Approaches

There are many approaches for circumventing the curse of dimensionality such as wrapper-based, embedded, filtering, and ensemble methods. Let us denote a subset of the full feature vector 𝒙tsubscript𝒙𝑡\boldsymbol{x}_{t}bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT as 𝒔t(k)subscriptsuperscript𝒔𝑘𝑡\boldsymbol{s}^{(k)}_{t}bold_italic_s start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, k=1,,2M𝑘1superscript2𝑀k=1,\ldots,2^{M}italic_k = 1 , … , 2 start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT, i.e., given M𝑀Mitalic_M features, there exist 2Msuperscript2𝑀2^{M}2 start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT different subsets of the feature set, which may include one or more features from 𝒙tsubscript𝒙𝑡\boldsymbol{x}_{t}bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT.

3.1.1 Wrapper-based Methods

As a greedy method, the well-established wrappers have an optimization objective over the validation loss that finds the best-performing feature subset 𝒔t*subscriptsuperscript𝒔𝑡\boldsymbol{s}^{*}_{t}bold_italic_s start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT as follows:

𝒔t*=argmin𝒔t(k)𝒙tLval=argmin𝒔t(k)𝒙tt=t1t2(yt,f𝒘(yt,𝒔t(k))).subscriptsuperscript𝒔𝑡subscriptargminsubscriptsuperscript𝒔𝑘𝑡subscript𝒙𝑡subscript𝐿𝑣𝑎𝑙subscriptargminsubscriptsuperscript𝒔𝑘𝑡subscript𝒙𝑡superscriptsubscriptsuperscript𝑡subscript𝑡1subscript𝑡2subscript𝑦superscript𝑡subscript𝑓𝒘subscript𝑦superscript𝑡subscriptsuperscript𝒔𝑘superscript𝑡\displaystyle\boldsymbol{s}^{*}_{t}=\operatorname*{arg\,min}_{\boldsymbol{s}^{% (k)}_{t}\subset\boldsymbol{x}_{t}}L_{val}=\operatorname*{arg\,min}_{% \boldsymbol{s}^{(k)}_{t}\subset\boldsymbol{x}_{t}}\sum_{t^{\prime}=t_{1}}^{t_{% 2}}\ell(y_{t^{\prime}},f_{\boldsymbol{w}}(y_{t^{\prime}},\boldsymbol{s}^{(k)}_% {t^{\prime}})).bold_italic_s start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT bold_italic_s start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⊂ bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT italic_v italic_a italic_l end_POSTSUBSCRIPT = start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT bold_italic_s start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⊂ bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT roman_ℓ ( italic_y start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT bold_italic_w end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , bold_italic_s start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) ) . (2)

subject to k=1,,2M𝑘1superscript2𝑀k=1,\ldots,2^{M}italic_k = 1 , … , 2 start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT, where {yt}{t1:t2}subscriptsubscript𝑦𝑡conditional-setsubscript𝑡1subscript𝑡2\{y_{t}\}_{\{t_{1}:t_{2}\}}{ italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT { italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT : italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT } end_POSTSUBSCRIPT and {st(k)}{t1:t2}subscriptsuperscriptsubscript𝑠𝑡𝑘conditional-setsubscript𝑡1subscript𝑡2\{s_{t}^{(k)}\}_{\{t_{1}:t_{2}\}}{ italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT { italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT : italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT } end_POSTSUBSCRIPT are the target sequence and the feature subset st(k)superscriptsubscript𝑠𝑡𝑘{s_{t}^{(k)}}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT of the validation set between t1,t2,1<t1<t2<tformulae-sequencesubscript𝑡1subscript𝑡21subscript𝑡1subscript𝑡2𝑡t_{1},t_{2}\in\mathbb{Z},\quad 1<t_{1}<t_{2}<titalic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ blackboard_Z , 1 < italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT < italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT < italic_t, fwsubscript𝑓𝑤f_{w}italic_f start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT is a machine learning model trained on the kthsuperscript𝑘thk^{\text{th}}italic_k start_POSTSUPERSCRIPT th end_POSTSUPERSCRIPT feature subset 𝒔t(k)subscriptsuperscript𝒔𝑘𝑡\boldsymbol{s}^{(k)}_{t}bold_italic_s start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT with parameters 𝒘𝒘\boldsymbol{w}bold_italic_w, and \ellroman_ℓ is the loss function of the model. The algorithm can iterate through each feature subset 𝒔t(k)subscriptsuperscript𝒔𝑘𝑡\boldsymbol{s}^{(k)}_{t}bold_italic_s start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT seeking to maximize the performance of the machine learning model on the validation set. Naturally, due to computational complexity issues, wrappers are hardly used in a complete form in real-life applications in most cases.

3.1.2 Embedded Methods

As another approach in the literature, embedded methods perform feature selection as a part of the model construction process. Examples of these methods include Random Forests, and Gradient Boosting Trees. Tree-based models eliminate features once or recursively (also known as Recursive Feature Elimination) based on their feature importance rankings and uses the remaining ones for training. One drawback of this method is that it is univariate, considering one feature at a time while calculating the scores.

Refer to caption
Figure 1: We have K𝐾Kitalic_K number of feature subsets used as inputs to K𝐾Kitalic_K base models (blue). Then, we combine base learners with 𝜶t(i)superscriptsubscript𝜶𝑡𝑖\boldsymbol{\alpha}_{t}^{(i)}bold_italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT for final prediction (pink).

3.1.3 Filtering Methods

Another traditional method of feature selection is filtering. Unlike other methods, filtering relies on statistical measures instead of using machine learning algorithms. For instance, the score of the Pearson Correlation Coefficient [21] between xt(k)superscriptsubscript𝑥superscript𝑡𝑘x_{t^{{}^{\prime}}}^{(k)}italic_x start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT and {yt}subscript𝑦superscript𝑡\{y_{t^{{}^{\prime}}}\}{ italic_y start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT end_POSTSUBSCRIPT } for ttsuperscript𝑡𝑡t^{{}^{\prime}}\leq titalic_t start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT ≤ italic_t is S(x(k))=cov(x(k),y)/(σ(x(k))σ(y))𝑆superscript𝑥𝑘covsuperscript𝑥𝑘𝑦𝜎superscript𝑥𝑘𝜎𝑦S(x^{(k)})={\text{cov}}(x^{(k)},y)/(\sigma(x^{(k)})\cdot\sigma(y))italic_S ( italic_x start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ) = cov ( italic_x start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT , italic_y ) / ( italic_σ ( italic_x start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ) ⋅ italic_σ ( italic_y ) ).

Filtering algorithm can incrementally form an optimal set 𝒔t*superscriptsubscript𝒔𝑡\boldsymbol{s}_{t}^{*}bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT in terms of correlation with m𝑚mitalic_m feature vectors from 𝒙tsubscript𝒙𝑡\boldsymbol{x}_{t}bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT based on maximum dependency. This is accomplished by discarding the lowest correlation score giving feature vectors from 𝒙tsubscript𝒙𝑡\boldsymbol{x}_{t}bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT in each iteration. On the other hand, filtering methods do not inherently incorporate domain knowledge and cannot measure nonlinear dependency since they solely rely on statistical measures. Moreover, they are univariate, calculating the score of each feature one by one.

3.1.4 Ensemble-based Methods

In this method, the predictions of machine learning models, also called base learners, are directly used to determine the weight vector that ensembles the base learners, as shown in Figure 1. The version with two base learners is demonstrated in Algorithm 1. All base learners take different feature subset vectors as input. Combining the predictions of K𝐾Kitalic_K base learners, denoted as y~t(i),i=1,,Kformulae-sequencesubscriptsuperscript~𝑦𝑖𝑡𝑖1𝐾{{\tilde{y}}^{(i)}_{t}},i=1,\ldots,Kover~ start_ARG italic_y end_ARG start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_i = 1 , … , italic_K, the ensemble prediction is found as follows:

y~tE=𝜶tT𝒚~t,superscriptsubscript~𝑦𝑡𝐸superscriptsubscript𝜶𝑡𝑇subscriptbold-~𝒚𝑡\centering{{\tilde{y}}_{t}^{E}=\boldsymbol{\alpha}_{t}}^{T}{\boldsymbol{\tilde% {y}}_{t}},\@add@centeringover~ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_E end_POSTSUPERSCRIPT = bold_italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT overbold_~ start_ARG bold_italic_y end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , (3)

where 𝒚~t=[y~t(1),,y~t(K)]Tsubscriptbold-~𝒚𝑡superscriptsubscriptsuperscript~𝑦1𝑡subscriptsuperscript~𝑦𝐾𝑡𝑇\boldsymbol{\tilde{y}}_{t}=[{\tilde{y}}^{(1)}_{t},...,{\tilde{y}}^{(K)}_{t}]^{T}overbold_~ start_ARG bold_italic_y end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = [ over~ start_ARG italic_y end_ARG start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … , over~ start_ARG italic_y end_ARG start_POSTSUPERSCRIPT ( italic_K ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT is the base prediction vector of K𝐾Kitalic_K machine learning models, and 𝜶tsubscript𝜶𝑡\boldsymbol{\alpha}_{t}bold_italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is the ensembling coefficient vector 𝜶tKsubscript𝜶𝑡superscript𝐾\boldsymbol{\alpha}_{t}\in\mathbb{R}^{K}bold_italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT. With affine-constraint optimization, the loss of ensemble models is subject to 𝜶𝒔t(i)T𝟏=1superscriptsubscript𝜶superscriptsubscript𝒔𝑡𝑖𝑇11\boldsymbol{\alpha}_{\boldsymbol{s}_{t}^{(i)}}^{T}\boldsymbol{1}=1bold_italic_α start_POSTSUBSCRIPT bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_1 = 1:

minα𝒔t(i)K(yt,i=1Kα𝒔t(i)(i)y~t(i)).subscriptminsubscript𝛼superscriptsubscript𝒔𝑡𝑖superscript𝐾subscript𝑦𝑡superscriptsubscript𝑖1𝐾superscriptsubscript𝛼superscriptsubscript𝒔𝑡𝑖𝑖superscriptsubscript~𝑦𝑡𝑖\centering\operatorname*{min}_{{\alpha}_{\boldsymbol{s}_{t}^{(i)}}\in\mathbb{R% }^{K}}\ell\big{(}{y}_{t},\sum_{i=1}^{K}{\alpha}_{\boldsymbol{s}_{t}^{(i)}}^{(i% )}\,{\tilde{y}}_{t}^{(i)}\big{)}.\@add@centeringroman_min start_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT end_POSTSUBSCRIPT roman_ℓ ( italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT over~ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) . (4)

The optimization hyperspace is (K1)𝐾1(K-1)( italic_K - 1 )-dimensional when the Kthsuperscript𝐾𝑡K^{th}italic_K start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT component of 𝜶tsubscript𝜶𝑡\boldsymbol{\alpha}_{t}bold_italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT complements the sum of the entire vector to be 1. Therefore, the subject of the minimization changes into α𝒔t(i)=1𝜶𝒔t(i)T𝟏subscript𝛼superscriptsubscript𝒔𝑡𝑖1superscriptsubscript𝜶superscriptsubscript𝒔𝑡𝑖𝑇1{\alpha}_{\boldsymbol{s}_{t}^{(i)}}=1-\boldsymbol{\alpha}_{\boldsymbol{s}_{t}^% {(i)}}^{T}\boldsymbol{1}italic_α start_POSTSUBSCRIPT bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT = 1 - bold_italic_α start_POSTSUBSCRIPT bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_1. In this sense, conventional ensemble methods can be computationally expensive, and more significantly, since they are independently trained, they cannot exploit the co-dependency between feature subsets.

Algorithm 1 Ensemble Model
1:y~t(1)=g(𝒔t(1),yt)subscriptsuperscript~𝑦1𝑡𝑔superscriptsubscript𝒔𝑡1subscript𝑦𝑡{\tilde{y}}^{(1)}_{t}=g(\boldsymbol{s}_{t}^{(1)},{y}_{t})over~ start_ARG italic_y end_ARG start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_g ( bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) \triangleright Train base learner g𝑔gitalic_g with 𝒔t(1)subscriptsuperscript𝒔1𝑡\boldsymbol{s}^{(1)}_{t}bold_italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT.
2:y~t(2)=h(𝒔t(2),yt)subscriptsuperscript~𝑦2𝑡subscriptsuperscript𝒔2𝑡subscript𝑦𝑡{\tilde{y}}^{(2)}_{t}=h(\boldsymbol{s}^{(2)}_{t},{y}_{t})over~ start_ARG italic_y end_ARG start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_h ( bold_italic_s start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) \triangleright Train base learner hhitalic_h with 𝒔t(2)subscriptsuperscript𝒔2𝑡\boldsymbol{s}^{(2)}_{t}bold_italic_s start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT.
3:for t=1𝑡1t=1italic_t = 1 to N𝑁Nitalic_N do \triangleright Iterate through each time step t𝑡titalic_t.
4:     min_lossmin_loss\text{min\_loss}\leftarrow\inftymin_loss ← ∞
5:     α~t0subscript~𝛼𝑡0\tilde{\alpha}_{t}\leftarrow 0over~ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ← 0
6:     for α=0𝛼0\alpha=0italic_α = 0 to 1111 do
7:         y~t=αy~t(1)+(1α)y~t(2)subscript~𝑦𝑡𝛼subscriptsuperscript~𝑦1𝑡1𝛼subscriptsuperscript~𝑦2𝑡{\tilde{y}}_{t}=\alpha\cdot{\tilde{y}}^{(1)}_{t}+(1-\alpha)\cdot{\tilde{y}}^{(% 2)}_{t}over~ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_α ⋅ over~ start_ARG italic_y end_ARG start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + ( 1 - italic_α ) ⋅ over~ start_ARG italic_y end_ARG start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT
8:         loss=(yt,y~t)losssubscript𝑦𝑡subscript~𝑦𝑡\text{loss}=\ell({y}_{t},\tilde{y}_{t})loss = roman_ℓ ( italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , over~ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )
9:         if loss<min_losslossmin_loss\text{loss}<\text{min\_loss}loss < min_loss then
10:              min_loss=lossmin_lossloss\text{min\_loss}=\text{loss}min_loss = loss
11:              α~tαsubscript~𝛼𝑡𝛼\tilde{\alpha}_{t}\leftarrow\alphaover~ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ← italic_α
12:         end if
13:     end for
14:end for

4 Hierarchical Ensemble-Based Approach

To overcome the limitations of traditional feature selection methods, we propose a hierarchical ensemble-based approach involving K𝐾Kitalic_K distinct machine learning models organized into K𝐾Kitalic_K levels. Figure 2 illustrates two sample successive layers of the structure. Each machine learning model takes the output of the previous layer (f(i1)superscript𝑓𝑖1f^{(i-1)}italic_f start_POSTSUPERSCRIPT ( italic_i - 1 ) end_POSTSUPERSCRIPT in Figure 2). Then, the latter model (f(i)superscript𝑓𝑖f^{(i)}italic_f start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT in Figure 2) generates the predictions of the optimized weights that scale the last output. Each model operates on a different subset of 𝒙tsubscript𝒙𝑡\boldsymbol{x}_{t}bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, which may include exogenous information, also called the side-information, and features derived from the past information of the target sequence {yt}subscript𝑦𝑡\{y_{t}\}{ italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT }, {ytj},j=1,,N1formulae-sequencesubscript𝑦𝑡𝑗𝑗1𝑁1\{y_{t-j}\},j=1,\ldots,N-1{ italic_y start_POSTSUBSCRIPT italic_t - italic_j end_POSTSUBSCRIPT } , italic_j = 1 , … , italic_N - 1, denoted as 𝒔t(y)subscript𝒔𝑡𝑦\boldsymbol{s}_{t}({y})bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_y ).

While the hierarchical ordering of the models is guided by domain expertise, we propose that the first level should exclusively comprise {ytj}subscript𝑦𝑡𝑗\{{y}_{t-j}\}{ italic_y start_POSTSUBSCRIPT italic_t - italic_j end_POSTSUBSCRIPT } (and features derived using them, e.g., their rolling means) referred to as ytsubscript𝑦𝑡{y}_{t}italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT-related features as shown in different time series prediction papers [32, 18]. The reason is that these past target values exhibit higher importance scores than the side information sequences, thereby dominantly influencing predictions. In fact, many machine learning-based time-series models suffer from “overfitting” to the ytsubscript𝑦𝑡{y}_{t}italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT-related features and ignore most of the features [10]. Subsequent levels can incorporate the side information sequences 𝒔t(k){𝒙t}𝒔t(y)superscriptsubscript𝒔𝑡𝑘subscript𝒙𝑡subscript𝒔𝑡𝑦\boldsymbol{s}_{t}^{(k)}\subset\{\boldsymbol{x}_{t}\}\setminus\boldsymbol{s}_{% t}({y})bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ⊂ { bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } ∖ bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_y ) for k=1,,K1𝑘1𝐾1k=1,\ldots,K-1italic_k = 1 , … , italic_K - 1. We next describe the layers in the hierarchy and the optimization procedure thereof.

Refer to caption
Figure 2: We have 2 feature subsets that are inputted to 2 different models in a hierarchical order. The first layer (orange) takes the ytsubscript𝑦𝑡y_{t}italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT-related features as input. In the next step, αt(i)superscriptsubscript𝛼𝑡𝑖{\alpha}_{t}^{(i)}italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT is generated with cost optimization. Then, the second layer (pink) predicts αt(i)superscriptsubscript𝛼𝑡𝑖{\alpha}_{t}^{(i)}italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT. Finally, the second layer predictions (green) are generated by combining α~t(i)superscriptsubscript~𝛼𝑡𝑖{\tilde{\alpha}}_{t}^{(i)}over~ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT and y~t(i)subscriptsuperscript~𝑦𝑖𝑡{\tilde{y}}^{(i)}_{t}over~ start_ARG italic_y end_ARG start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT.

4.1 Description of Layers

To explicate the layers composing the introduced architecture, in Figure 2, we present a snapshot of the generic model with all the major components in action; these represent the core operations in the overall hierarchy which are repeated in succession. In Figure 2, we observe four main components in transitioning from layer i1𝑖1i-1italic_i - 1 to i𝑖iitalic_i: two machine learning models (left and middle right), a cost optimization function (middle left), and a linear superposition function (right). The leftmost machine learning model, f(i1)superscript𝑓𝑖1f^{(i-1)}italic_f start_POSTSUPERSCRIPT ( italic_i - 1 ) end_POSTSUPERSCRIPT, is fed with (i1)th𝑖1th(i-1)\textsuperscript{th}( italic_i - 1 ) restricted side information sequence 𝒔t(i1)superscriptsubscript𝒔𝑡𝑖1\boldsymbol{s}_{t}^{(i-1)}bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i - 1 ) end_POSTSUPERSCRIPT, which corresponds to the “feature bagging” technique for addressing the bias-variance trade-off [6], as well as with the refined predictions of the previous layer, which we will elaborate on in the following. There is no restriction on what this learning model could be, nor on the loss function it aims to minimize. After the learning is finished, the predictions of the model are acquired and passed onto the cost optimization block (middle left part in Figure 2). Therein lies the novelty of our algorithm, as we, unlike the usual boosting procedure, e.g., LightGBM uses, do not transmit these predictions as is to the next model in the chain but instead subject it to a weighting. To minimize the final loss, the cost optimization function g𝑔gitalic_g generates a weight sequence αt(i)superscriptsubscript𝛼𝑡𝑖\alpha_{t}^{(i)}italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT that scales the previous prediction sequence, y~t(i1)superscriptsubscript~𝑦𝑡𝑖1\tilde{y}_{t}^{(i-1)}over~ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i - 1 ) end_POSTSUPERSCRIPT. Another novelty of this method lies in the compatibility of any loss function in the cost optimization stage, which extends on the limitations of loss functions with non-practical second derivatives, e.g., the L1 loss. The details of the optimization of αt(i)superscriptsubscript𝛼𝑡𝑖\alpha_{t}^{(i)}italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPTs are given in Section 4.2.

For context-awareness, we further feed these optimized αt(i)superscriptsubscript𝛼𝑡𝑖\alpha_{t}^{(i)}italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPTs into another learning model, f(i)superscript𝑓𝑖f^{(i)}italic_f start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT, which uses the current context 𝒔t(i)superscriptsubscript𝒔𝑡𝑖\boldsymbol{s}_{t}^{(i)}bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT for its training; we acquire the context-aware weights α~t(i)superscriptsubscript~𝛼𝑡𝑖\tilde{\alpha}_{t}^{(i)}over~ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT out of it. Then in the last superposition stage, a linear function hhitalic_h scales the prediction of the leftmost model f(i1)superscript𝑓𝑖1f^{(i-1)}italic_f start_POSTSUPERSCRIPT ( italic_i - 1 ) end_POSTSUPERSCRIPT with α~t(i)superscriptsubscript~𝛼𝑡𝑖\tilde{\alpha}_{t}^{(i)}over~ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPTs to obtain the refined predictions, which are then fed to the next block in the series. The chain of blocks continues this way until i𝑖iitalic_i hits the user-defined hierarchy size parameter K𝐾Kitalic_K. This flow of the model is depicted in Algorithm 2.

Formally, in the transition from (i1)th𝑖1th(i-1)\textsuperscript{th}( italic_i - 1 ) to ith𝑖thi\textsuperscript{th}italic_i layer of the algorithm, we first generate (i1)th𝑖1th(i-1)\textsuperscript{th}( italic_i - 1 ) model’s predictions for each time t𝑡titalic_t, denoted as y~t(i1)subscriptsuperscript~𝑦𝑖1𝑡{\tilde{y}}^{(i-1)}_{t}over~ start_ARG italic_y end_ARG start_POSTSUPERSCRIPT ( italic_i - 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, where i=2,,K𝑖2𝐾i=2,\ldots,Kitalic_i = 2 , … , italic_K, inputting 𝒔t(i1)superscriptsubscript𝒔𝑡𝑖1\boldsymbol{s}_{t}^{(i-1)}bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i - 1 ) end_POSTSUPERSCRIPT into base learner f(i1)superscript𝑓𝑖1f^{(i-1)}italic_f start_POSTSUPERSCRIPT ( italic_i - 1 ) end_POSTSUPERSCRIPT. Subsequently, we deduce the weight sequence αt(i)subscriptsuperscript𝛼𝑖𝑡{\alpha}^{(i)}_{t}italic_α start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to refine y~t(i1)subscriptsuperscript~𝑦𝑖1𝑡{\tilde{y}}^{(i-1)}_{t}over~ start_ARG italic_y end_ARG start_POSTSUPERSCRIPT ( italic_i - 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT by optimizing iteratively in a loop as depicted in Algorithm 2 for:

y^t(i)=αt(i)y~t(i),subscriptsuperscript^𝑦𝑖𝑡superscriptsubscript𝛼𝑡𝑖superscriptsubscript~𝑦𝑡𝑖\centering{\hat{y}}^{(i)}_{t}={{\alpha}_{t}^{(i)}}{\tilde{y}}_{t}^{(i)},\@add@centeringover^ start_ARG italic_y end_ARG start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT over~ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , (5)

where αt(i)[0,2]superscriptsubscript𝛼𝑡𝑖02\alpha_{t}^{(i)}\in[0,2]italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ∈ [ 0 , 2 ] for each time t𝑡titalic_t. As depicted in the middle left pink block in Figure 2, we determine αt(i)subscriptsuperscript𝛼𝑖𝑡{\alpha}^{(i)}_{t}italic_α start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT with an optimization loop to minimize the loss at each time t𝑡titalic_t in layer i𝑖iitalic_i as,

Lt=(yt,y^t(i))subscript𝐿𝑡subscript𝑦𝑡superscriptsubscript^𝑦𝑡𝑖\displaystyle L_{t}=\ell({y}_{t},{\hat{y}}_{t}^{(i)})italic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = roman_ℓ ( italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) =(yt,αt(i)y~t(i)).absentsubscript𝑦𝑡superscriptsubscript𝛼𝑡𝑖superscriptsubscript~𝑦𝑡𝑖\displaystyle=\ell({y}_{t},{{\alpha}_{t}^{(i)}}{\tilde{y}}_{t}^{(i)}).= roman_ℓ ( italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT over~ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) . (6)

In general, \ellroman_ℓ need not be differentiable. Unlike ensemble methods, we optimize \ellroman_ℓ with y~t(i)subscriptsuperscript~𝑦𝑖𝑡{\tilde{y}}^{(i)}_{t}over~ start_ARG italic_y end_ARG start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and y~t(i1)subscriptsuperscript~𝑦𝑖1𝑡{\tilde{y}}^{(i-1)}_{t}over~ start_ARG italic_y end_ARG start_POSTSUPERSCRIPT ( italic_i - 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT by scaling y~t(i1)subscriptsuperscript~𝑦𝑖1𝑡{\tilde{y}}^{(i-1)}_{t}over~ start_ARG italic_y end_ARG start_POSTSUPERSCRIPT ( italic_i - 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT at each time t𝑡titalic_t with αt(i)subscriptsuperscript𝛼𝑖𝑡{\alpha}^{(i)}_{t}italic_α start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. The cost optimization step is further elaborated in Section 4.2. In this sense, in the ithsuperscript𝑖𝑡i^{th}italic_i start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT layer, we update the output of the (i1)thsuperscript𝑖1𝑡(i-1)^{th}( italic_i - 1 ) start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT layer using the features that belong to the ithsuperscript𝑖𝑡i^{th}italic_i start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT layer. Therefore, every subset of features contributes to minimizing the final error similar to boosting or stacking.

As shown in the middle right block in Figure 2, we train the consequent model (f(i)superscript𝑓𝑖f^{(i)}italic_f start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT in Figure 2) with αt(i)subscriptsuperscript𝛼𝑖𝑡{\alpha}^{(i)}_{t}italic_α start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and 𝒔t(i)superscriptsubscript𝒔𝑡𝑖\boldsymbol{s}_{t}^{(i)}bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT to predict the ultimate weights α~t(i)subscriptsuperscript~𝛼𝑖𝑡{{\tilde{\alpha}}}^{(i)}_{t}over~ start_ARG italic_α end_ARG start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, which are context-aware. In the green block at the end, we modify y~t(i)subscriptsuperscript~𝑦𝑖𝑡{\tilde{y}}^{(i)}_{t}over~ start_ARG italic_y end_ARG start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT using α~t(i)subscriptsuperscript~𝛼𝑖𝑡{\tilde{\alpha}}^{(i)}_{t}over~ start_ARG italic_α end_ARG start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to incorporate the side information, leveraging the learned patterns from the error of the previous layer with a linear superposition function hhitalic_h, as represented by

y~t(i+1)=h(y~t(i),α~t(i))=α~t(i)y~t(i).subscriptsuperscript~𝑦𝑖1𝑡superscriptsubscript~𝑦𝑡𝑖superscriptsubscript~𝛼𝑡𝑖subscriptsuperscript~𝛼𝑖𝑡subscriptsuperscript~𝑦𝑖𝑡{\tilde{y}}^{(i+1)}_{t}=h(\tilde{y}_{t}^{(i)},\tilde{\alpha}_{t}^{(i)})={{% \tilde{\alpha}}^{(i)}_{t}}{\tilde{y}}^{(i)}_{t}.over~ start_ARG italic_y end_ARG start_POSTSUPERSCRIPT ( italic_i + 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_h ( over~ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , over~ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) = over~ start_ARG italic_α end_ARG start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT over~ start_ARG italic_y end_ARG start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT . (7)

In the following section, the cost optimization step between layers 1 and 2 in Algorithm 2 is elaborated.

Algorithm 2 Hierarchical Ensemble-Based Method
1:𝒔t(i1)𝒔t(y)𝑿, where t=1,2,,N,i=2,3,,K+1formulae-sequencesubscriptsuperscript𝒔𝑖1𝑡subscript𝒔𝑡𝑦𝑿formulae-sequence where 𝑡12𝑁𝑖23𝐾1\boldsymbol{s}^{(i-1)}_{t}\leftarrow\boldsymbol{s}_{t}({y})\in\boldsymbol{X},% \text{ where }t=1,2,\dots,N,i=2,3,\dots,K+1bold_italic_s start_POSTSUPERSCRIPT ( italic_i - 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ← bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_y ) ∈ bold_italic_X , where italic_t = 1 , 2 , … , italic_N , italic_i = 2 , 3 , … , italic_K + 1
2:y~t(i)f(i1)(𝒔t(i1),y~(i1){\tilde{y}}^{(i)}_{t}\leftarrow f^{(i-1)}(\boldsymbol{s}^{(i-1)}_{t},\tilde{y}% ^{(i-1)}over~ start_ARG italic_y end_ARG start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ← italic_f start_POSTSUPERSCRIPT ( italic_i - 1 ) end_POSTSUPERSCRIPT ( bold_italic_s start_POSTSUPERSCRIPT ( italic_i - 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , over~ start_ARG italic_y end_ARG start_POSTSUPERSCRIPT ( italic_i - 1 ) end_POSTSUPERSCRIPT)
3:procedure Cost Optimization
4:     for t=1,2,,N𝑡12𝑁t=1,2,\dots,Nitalic_t = 1 , 2 , … , italic_N do
5:         min_loss𝑚𝑖𝑛_𝑙𝑜𝑠𝑠min\_loss\leftarrow\inftyitalic_m italic_i italic_n _ italic_l italic_o italic_s italic_s ← ∞
6:         range=[1β,1+β], where β[0,1]formulae-sequence𝑟𝑎𝑛𝑔𝑒1𝛽1𝛽 where 𝛽01range=[1-\beta,1+\beta],\text{ where }\beta\in[0,1]\subset\mathbb{R}italic_r italic_a italic_n italic_g italic_e = [ 1 - italic_β , 1 + italic_β ] , where italic_β ∈ [ 0 , 1 ] ⊂ blackboard_R
7:         for α1β𝛼1𝛽\alpha\leftarrow 1-\betaitalic_α ← 1 - italic_β to 1+β1𝛽1+\beta1 + italic_β do
8:              y^t(i)=αy~t(i)subscriptsuperscript^𝑦𝑖𝑡𝛼subscriptsuperscript~𝑦𝑖𝑡\hat{y}^{(i)}_{t}=\alpha\tilde{y}^{(i)}_{t}over^ start_ARG italic_y end_ARG start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_α over~ start_ARG italic_y end_ARG start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT
9:              loss=(yt,y^t(i))𝑙𝑜𝑠𝑠subscript𝑦𝑡subscriptsuperscript^𝑦𝑖𝑡loss=\ell(y_{t},\hat{y}^{(i)}_{t})italic_l italic_o italic_s italic_s = roman_ℓ ( italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , over^ start_ARG italic_y end_ARG start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )
10:              if loss<min_loss𝑙𝑜𝑠𝑠𝑚𝑖𝑛_𝑙𝑜𝑠𝑠loss<min\_lossitalic_l italic_o italic_s italic_s < italic_m italic_i italic_n _ italic_l italic_o italic_s italic_s then
11:                  min_loss=loss𝑚𝑖𝑛_𝑙𝑜𝑠𝑠𝑙𝑜𝑠𝑠min\_loss=lossitalic_m italic_i italic_n _ italic_l italic_o italic_s italic_s = italic_l italic_o italic_s italic_s
12:                  αt(i)=αsuperscriptsubscript𝛼𝑡𝑖𝛼\alpha_{t}^{(i)}=\alphaitalic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT = italic_α
13:              end if
14:         end for
15:     end for
16:end procedure
17:𝒔t(i)𝒔t(i){𝒙t}𝒔t(y)subscriptsuperscript𝒔𝑖𝑡subscriptsuperscript𝒔𝑖𝑡subscript𝒙𝑡subscript𝒔𝑡𝑦\boldsymbol{s}^{(i)}_{t}\leftarrow\boldsymbol{s}^{(i)}_{t}\subset\{\boldsymbol% {x}_{t}\}\setminus\boldsymbol{s}_{t}({y})bold_italic_s start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ← bold_italic_s start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⊂ { bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } ∖ bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_y )
18:α~t(i)f(i)(𝒔t(i),αt(i))superscriptsubscript~𝛼𝑡𝑖superscript𝑓𝑖subscriptsuperscript𝒔𝑖𝑡superscriptsubscript𝛼𝑡𝑖{\tilde{\alpha}}_{t}^{(i)}\leftarrow f^{(i)}(\boldsymbol{s}^{(i)}_{t},{\alpha}% _{t}^{(i)})over~ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ← italic_f start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ( bold_italic_s start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT )
19:y~t(i+1)α~t(i)y~(i)subscriptsuperscript~𝑦𝑖1𝑡superscriptsubscript~𝛼𝑡𝑖superscript~𝑦𝑖{\tilde{y}}^{(i+1)}_{t}\leftarrow{\tilde{\alpha}}_{t}^{(i)}{\tilde{y}}^{(i)}over~ start_ARG italic_y end_ARG start_POSTSUPERSCRIPT ( italic_i + 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ← over~ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT over~ start_ARG italic_y end_ARG start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT

4.2 Cost Optimization

The cost optimization refers to the iterative approach, in which we employ αt(i)superscriptsubscript𝛼𝑡𝑖\alpha_{t}^{(i)}italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT from a determined range to modify y~t(i)superscriptsubscript~𝑦𝑡𝑖\tilde{y}_{t}^{(i)}over~ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT in any layer i𝑖iitalic_i. We aim to find αt(i)superscriptsubscript𝛼𝑡𝑖\alpha_{t}^{(i)}\in\mathbb{R}italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ∈ blackboard_R for each time t𝑡titalic_t. Finally, we have the following optimization objective:

minαt(yt,αt(i)y~t(i)),subscriptminsubscript𝛼𝑡subscript𝑦𝑡superscriptsubscript𝛼𝑡𝑖superscriptsubscript~𝑦𝑡𝑖\centering\operatorname*{min}_{{\alpha}_{t}\in\mathbb{R}}\ell\big{(}{y}_{t},{% \alpha}_{t}^{(i)}\,{\tilde{y}}_{t}^{(i)}\big{)},\@add@centeringroman_min start_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R end_POSTSUBSCRIPT roman_ℓ ( italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT over~ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) , (8)

where \ellroman_ℓ can be any loss subject to αt(i)[0,2]superscriptsubscript𝛼𝑡𝑖02{\alpha}_{t}^{(i)}\in[0,2]italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ∈ [ 0 , 2 ] for each time t𝑡titalic_t such that if a weight is in the range [0,1]01[0,1][ 0 , 1 ] the prediction is effectively downscaled, and similarly for the range [1,2]12[1,2][ 1 , 2 ], we employ upscaling to leverage robustness. This procedure aims to be flexible so that the algorithm can be scaled to any domain-specific loss function. The “Cost Optimization” section in Algorithm 2 and the leftmost pink block in Figure 2 show the structure of the cost function.

To understand the contribution of our cost optimization approach, we compare it with the loss structure of a powerful tree-based model, LightGBM [15]. During the training process, LightGBM iteratively updates the model by minimizing the chosen cost function. This process is typically performed using gradient-boosting techniques. Choosing the base model of LightGBM as a decision tree, the leaf split finding operation is completed with the high-level insight provided by the hessian and gradient of the loss function. Defining the gradient as

gt(x)=Ey[(yt,f(xt,θ,yt))f(xt,θ,yt)|x]f(xt,θ,yt)=f^t1(xt,θ,yt),subscript𝑔𝑡𝑥subscript𝐸𝑦subscriptdelimited-[]conditionalsubscript𝑦𝑡𝑓subscript𝑥𝑡𝜃subscript𝑦𝑡𝑓subscript𝑥𝑡𝜃subscript𝑦𝑡𝑥𝑓subscript𝑥𝑡𝜃subscript𝑦𝑡subscript^𝑓𝑡1subscript𝑥𝑡𝜃subscript𝑦𝑡{g}_{t}(x)=\mathit{E}_{y}\left[\frac{\partial\ell(y_{t},f(x_{t},\theta,y_{t}))% }{\partial f(x_{t},\theta,y_{t})}\,|\,x\right]_{f(x_{t},\theta,y_{t})=\hat{f}_% {t-1}(x_{t},\theta,y_{t})},italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x ) = italic_E start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT [ divide start_ARG ∂ roman_ℓ ( italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_f ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_θ , italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) end_ARG start_ARG ∂ italic_f ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_θ , italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG | italic_x ] start_POSTSUBSCRIPT italic_f ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_θ , italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_θ , italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT , (9)

and the hessian is defined as

ht(x)=Ey[2(yt,f(xt,θ,yt))f(xt,θ,yt)2|x]f(xt,θ,yt)=f^t1(xt,θ,yt).subscript𝑡𝑥subscript𝐸𝑦subscriptdelimited-[]conditionalsuperscript2subscript𝑦𝑡𝑓subscript𝑥𝑡𝜃subscript𝑦𝑡𝑓superscriptsubscript𝑥𝑡𝜃subscript𝑦𝑡2𝑥𝑓subscript𝑥𝑡𝜃subscript𝑦𝑡subscript^𝑓𝑡1subscript𝑥𝑡𝜃subscript𝑦𝑡{h}_{t}(x)=\mathit{E}_{y}\left[\frac{\partial^{2}\ell(y_{t},f(x_{t},\theta,y_{% t}))}{\partial f(x_{t},\theta,y_{t})^{2}}\,|\,x\right]_{f(x_{t},\theta,y_{t})=% \hat{f}_{t-1}(x_{t},\theta,y_{t})}.italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x ) = italic_E start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT [ divide start_ARG ∂ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_ℓ ( italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_f ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_θ , italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) end_ARG start_ARG ∂ italic_f ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_θ , italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG | italic_x ] start_POSTSUBSCRIPT italic_f ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_θ , italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_θ , italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT . (10)

If we use the customized loss function of LightGBM, it would calculate the gradient and hessian of \ellroman_ℓ with respect to y~t(i)superscriptsubscript~𝑦𝑡𝑖\tilde{y}_{t}^{(i)}over~ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT to determine the direction and magnitude of the updates. As an example, the negative gradient for L1 loss is given by

gt(i)=(yt,αty~t(i))y~t(i)=αt,superscriptsubscript𝑔𝑡𝑖subscript𝑦𝑡subscript𝛼𝑡superscriptsubscript~𝑦𝑡𝑖superscriptsubscript~𝑦𝑡𝑖subscript𝛼𝑡g_{t}^{(i)}=\frac{\partial\ell(y_{t},\alpha_{t}\tilde{y}_{t}^{(i)})}{\partial% \tilde{y}_{t}^{(i)}}=\alpha_{t},italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT = divide start_ARG ∂ roman_ℓ ( italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT over~ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) end_ARG start_ARG ∂ over~ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT end_ARG = italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , (11)

which leads to the hessian being impractically 0, in a form that LightGBM cannot natively process. Therefore, the hessian of the loss function should be nonzero while working with a custom function.

In our application, we bring another approach to make it convenient to embed any custom loss into our base learners, e.g., LightGBM. In our case, we define a β𝛽\beta\in\mathbb{R}italic_β ∈ blackboard_R value, which is also a tuned hyperparameter in the range of [0,1]01[0,1][ 0 , 1 ], for 1+β1𝛽1+\beta1 + italic_β to be the higher limit and 1β1𝛽1-\beta1 - italic_β be the lower limit of the chosen αt(i)superscriptsubscript𝛼𝑡𝑖\alpha_{t}^{(i)}italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT value. As depicted in Algorithm 2, we find the optimal αt(i)superscriptsubscript𝛼𝑡𝑖\alpha_{t}^{(i)}italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT in a greedy process aiming to minimize any custom loss function, directly for each time step t𝑡titalic_t. Therefore, the objective in (6) is employed. After this step, α~t(i)superscriptsubscript~𝛼𝑡𝑖\tilde{\alpha}_{t}^{(i)}over~ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT vector is inputted to the following model depicted as f(i)superscript𝑓𝑖f^{(i)}italic_f start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT in Figure 2.

By iterating the process in Algorithm 2, we can effectively search for the best combination of features and capture the correlation between them. For the sake of improving traditional feature selection methods, the iterative process allows us to optimize K𝐾Kitalic_K models simultaneously, as the predicted output from one model is used to enhance the other model.

In the next section, we illustrate the performance of our hierarchical ensemble-based model on synthetic and widely known real-life datasets.

5 Simulations

In this section, we illustrate the performance of our hierarchical ensemble-based model with 2 layers, i.e., K=2𝐾2K=2italic_K = 2, in comparison with other models on various well-known time series datasets. Initially, we introduce the models used for comparison. Then, we provide the performance of our model.

5.1 Compared Models

The simulations include 5 models that are labeled as Wrapper, Ensemble, Hierarchical Ensemble, Embedded, Baseline LightGBM. The first compared method, Wrapper, which is described in Section 3, discards the feature that gives the least contribution to the model based on the L2 loss in each iteration. The Embedded model described in Section 3 only uses {ytj}subscript𝑦𝑡𝑗\{y_{t-j}\}{ italic_y start_POSTSUBSCRIPT italic_t - italic_j end_POSTSUBSCRIPT } and features derived from {ytj}subscript𝑦𝑡𝑗\{y_{t-j}\}{ italic_y start_POSTSUBSCRIPT italic_t - italic_j end_POSTSUBSCRIPT }, e.g. rolling features, namely as 𝒔t(y)subscript𝒔𝑡𝑦\boldsymbol{s}_{t}(y)bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_y ). The reason that ytsubscript𝑦𝑡y_{t}italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT-related features are employed in the simulation as input to another model is to investigate if the most important features are enough without requiring domain knowledge. The Ensemble model works based on Algorithm 1 with 2 baseline models, mixing 𝒔t(y)subscript𝒔𝑡𝑦\boldsymbol{s}_{t}(y)bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_y ) and {𝒙t}{𝒔t(y)}subscript𝒙𝑡subscript𝒔𝑡𝑦\{\boldsymbol{x}_{t}\}\setminus\{\boldsymbol{s}_{t}(y)\}{ bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } ∖ { bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_y ) } with αt(i)superscriptsubscript𝛼𝑡𝑖{\alpha}_{t}^{(i)}italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT which is chosen in an iterative process minimizing the L1 objective. Lastly, Baseline LightGBM model refers to the model that uses the whole 𝒙tsubscript𝒙𝑡\boldsymbol{x}_{t}bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. With this model, we seek to find if feature selection for the datasets is necessary at all. Moreover, we expect to observe overfitting due to the high dimensionality and non-stationarity.

The evaluation metric used in the experiments is the mean square error. All experiments are iterated 200 times to ensure the reliability of the results. The synthetic dataset is generated 200 times with a random Gaussian noise. For the real-life dataset, we randomly sampled 200 out of 414 series of the hourly M4 Forecasting, which is the widely publicized competition dataset [31]. We obtained the cumulative sum of error between {y~t}subscript~𝑦𝑡\{\tilde{y}_{t}\}{ over~ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } and {yt}subscript𝑦𝑡\{y_{t}\}{ italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } for the jthsuperscript𝑗𝑡j^{th}italic_j start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT experiment as follows:

lt=(j)(yt(j)y~t(j))2/N.\centering{\mathit{l}}_{t}{{}^{(j)}}=({y}^{(j)}_{t}-{\tilde{y}}^{(j)}_{t})^{2}% /N.\@add@centeringitalic_l start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_FLOATSUPERSCRIPT ( italic_j ) end_FLOATSUPERSCRIPT = ( italic_y start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - over~ start_ARG italic_y end_ARG start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / italic_N . (12)

Then, the average over the 200 trials is taken so as to eliminate the bias due to using a particular sequence as,

l¯t=j=1200lt(j)200.subscript¯𝑙𝑡superscriptsubscript𝑗1200subscriptsuperscript𝑙𝑗𝑡200\centering{\bar{l}}_{t}=\frac{\sum_{j=1}^{200}{l}^{(j)}_{t}}{200}.\@add@centeringover¯ start_ARG italic_l end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = divide start_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 200 end_POSTSUPERSCRIPT italic_l start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG 200 end_ARG . (13)

Finally, the cumulative sum over time is taken to smoothen the results as follows:

l¯(ave)=t=1tl¯tt.superscript¯𝑙𝑎𝑣𝑒superscriptsubscriptsuperscript𝑡1𝑡subscript¯𝑙superscript𝑡𝑡\centering{\bar{l}}^{(ave)}=\frac{\sum_{t^{{}^{\prime}}=1}^{t}{\bar{l}}_{t^{{}% ^{\prime}}}}{t}.\@add@centeringover¯ start_ARG italic_l end_ARG start_POSTSUPERSCRIPT ( italic_a italic_v italic_e ) end_POSTSUPERSCRIPT = divide start_ARG ∑ start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT over¯ start_ARG italic_l end_ARG start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_ARG start_ARG italic_t end_ARG . (14)

About the experiment settings, the β𝛽\betaitalic_β hyperparameter in the cost optimization step depicted in Algorithm 2 is fixed to 0.33 for both experiments. Therefore, the range of αt(i)superscriptsubscript𝛼𝑡𝑖\alpha_{t}^{(i)}italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT is [0.66,1.33]0.661.33[0.66,1.33][ 0.66 , 1.33 ].

In the next sections, we give analysis and experiments of synthetic and real-life datasets. As we simulate with K=2𝐾2K=2italic_K = 2 models, we denote 𝒔t(i1)superscriptsubscript𝒔𝑡𝑖1\boldsymbol{s}_{t}^{(i-1)}bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i - 1 ) end_POSTSUPERSCRIPT as 𝒔t(1)superscriptsubscript𝒔𝑡1\boldsymbol{s}_{t}^{(1)}bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT; likewise αt(i)superscriptsubscript𝛼𝑡𝑖\alpha_{t}^{(i)}italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT as αtsubscript𝛼𝑡\alpha_{t}italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT.

5.2 Synthetic Dataset

The data is generated with an autoregressive moving average (ARMA) process of order (4,5)45(4,5)( 4 , 5 ), i.e.,

ytsubscript𝑦𝑡\displaystyle\mathit{y_{t}}italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT =i=14Φiy(ti)+j=15θjϵ(tj)+ϵt,absentsuperscriptsubscript𝑖14subscriptΦ𝑖subscript𝑦𝑡𝑖superscriptsubscript𝑗15subscript𝜃𝑗subscriptitalic-ϵ𝑡𝑗subscriptitalic-ϵ𝑡\displaystyle=\sum_{i=1}^{4}\Phi_{i}\mathit{y_{(t-i)}}+\sum_{j=1}^{5}\theta_{j% }\epsilon_{(t-j)}+\epsilon_{t},= ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT roman_Φ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT ( italic_t - italic_i ) end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT italic_θ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_ϵ start_POSTSUBSCRIPT ( italic_t - italic_j ) end_POSTSUBSCRIPT + italic_ϵ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , (15)

where the autoregressive part is represented by the lagged values up to 4 times, and the moving average is represented by the lagged error terms up to 5 times. The ΦΦ\Phiroman_Φ and θ𝜃\thetaitalic_θ variables control the strength of the lags [5]. In our setting, Φ=[0.4,0.3,0.2,0.1]Φ0.40.30.20.1\Phi=[0.4,0.3,0.2,0.1]roman_Φ = [ 0.4 , 0.3 , 0.2 , 0.1 ], and θ=[0.65,0.35,0.3,0.15,0.3]𝜃0.650.350.30.150.3\theta=[0.65,0.35,0.3,-0.15,-0.3]italic_θ = [ 0.65 , 0.35 , 0.3 , - 0.15 , - 0.3 ]. The Augmented Dickey-Fuller [9] test reveals the p-value as 0.2116, showing non-stationarity. Then, the series is transformed with a min-max scaler.

We first generated the domain knowledge-representing feature subset {𝒙t}{𝒔t(y)}subscript𝒙𝑡subscript𝒔𝑡𝑦\{\boldsymbol{x}_{t}\}\setminus\{\boldsymbol{s}_{t}(y)\}{ bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } ∖ { bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_y ) }. For that, a binary classification feature set is designed with fully informative 26 features. The synthetically generated binary sequence ytbsuperscriptsubscript𝑦𝑡𝑏y_{t}^{b}italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT has a class imbalance of 0.650.650.650.65. The ytsubscript𝑦𝑡y_{t}italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT values obtained from (15) that have indexes in the corresponding label 1 in ytbsuperscriptsubscript𝑦𝑡𝑏y_{t}^{b}italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT are multiplied by 1.331.331.331.33 to upscale, while others are multiplied by 0.660.660.660.66 to downscale. Therefore, we guarantee dependence on the generated domain knowledge series, which can provide a substantial amount of pattern. As the final step of generation, we added a Gaussian normal noise of N(0,0.5)𝑁00.5\mathit{N}(0,0.5)italic_N ( 0 , 0.5 ). To simulate the curse of dimensionality problem, the number of total features is set to 36, as demonstrated in Table 1, and most of the features are informative. The average p-value is also higher than 0.050.050.050.05, which validates that our experiment settings are suitable for non-stationarity.

Refer to caption
Figure 3: Comparison of the mean square error performances of Hierarchical Ensemble (black), Ensemble (blue), Base LightGBM (green), Embedded (red), Wrapper (purple) for the synthetic dataset.

Upon generating the synthetic dataset with high dimensionality and non-stationarity, our dataset is well-suited to the problem statement in Section 3.

Based on the performance plot in Figure 3, we highlight that our proposed method outperforms other compared models while the overall loss trend is decreasing. We confidently prompt that the ytsubscript𝑦𝑡y_{t}italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT-related features, given to 𝒔t(1)superscriptsubscript𝒔𝑡1\boldsymbol{s}_{t}^{(1)}bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT, find most of the patterns in the label while 𝒔t(2)superscriptsubscript𝒔𝑡2\boldsymbol{s}_{t}^{(2)}bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT finetunes the short-term patterns, as desired. Moreover, the descending trend indicates the robustness of our method. The Wrapper method is significantly the worst-performing model based on the MSE scores since it could not converge in a global optimum, using a suboptimal feature subset. From Table 2, the extreme time consumption of this method compared to Hierarchical Ensemble also verifies the time efficiency of our model. Hence, Wrapper is the least efficient model among other compared methods. Additionally, Embedded and Ensemble models are significantly close in loss. One can say that Embedded model generates long-term patterns successfully with ytsubscript𝑦𝑡y_{t}italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT-related features. Therefore, Ensemble chooses ytsubscript𝑦𝑡y_{t}italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT-related compared to date-related features. Even though the optimal feature subset is found by Embedded, it is outperformed by Hierarchical Ensemble since our method fully incorporates the information provided by codependent feature pairs rather than deciding whether to choose ytsubscript𝑦𝑡y_{t}italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT-related or domain knowledge features. Moreover, Baseline LightGBM performs the closest to the proposed approach, which indicates the level of informativeness of the features. However, our proposed approach outperforms Baseline LightGBM since we overcome overfitting by training multiple models with less number of features while the other is overfitted easily due to a large number of features.

Table 1: Statistics of the Datasets
Dataset Sample Size Feature Size Average p-value
Synthetic 500 36 0.2376
M4 Hourly Forecasting 1001 24 0.36856

5.3 M4 Competition Datasets

Here, the hourly M4 dataset is used as a real-life dataset, which includes 414 different time series data [31]. About the structure of the series, the M4 competition dataset does not include date-time indexes. Hence, we give the indexes externally to the dataset to extract date-related features. The sample size of the train set was reduced to 953 while the test set includes 48 samples as demonstrated in Table 1. Investigating ytsubscript𝑦𝑡y_{t}italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT regarding the stationarity test; we found the p-value to be 0.368560.368560.368560.36856 on average, which indicates non-stationarity, as demonstrated in Table 1. We give the ytsubscript𝑦𝑡y_{t}italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT-related features to 𝒔t(1)superscriptsubscript𝒔𝑡1\boldsymbol{s}_{t}^{(1)}bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT to be consistent with the synthetic dataset. We transformed the desired data through a min-max scaler. In this setting, we took the 2ndsuperscript2𝑛𝑑2^{nd}2 start_POSTSUPERSCRIPT italic_n italic_d end_POSTSUPERSCRIPT, 4thsuperscript4𝑡4^{th}4 start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT, and 6thsuperscript6𝑡6^{th}6 start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT lags of the desired data to generate the mean and standard deviation of the rolling window feature. The date time features are included in 𝒔t(2)superscriptsubscript𝒔𝑡2\boldsymbol{s}_{t}^{(2)}bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT. The date-related features are the cosine and sinus vectors of the hour, day of the month, day of the week, month of the year, quarter of the year, and week of the year.

Based on the performance of the proposed method in the M4 hourly dataset in Figure 4, the Hierarchical Ensemble outperforms other models. We highlight that the ytsubscript𝑦𝑡y_{t}italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT-related features given in a higher level of hierarchy form the long-term patterns while the feature subset containing domain knowledge easily upscales or downscales the long-term patterns. Moreover, we demonstrate a more robust performance by the stationary cumulative error compared to other methods which are either oscillating. The reason that Baseline LightGBM performs worse than our method is because of the high dimensionality causing overfitting. Although the feature size is less than the synthetic experiment, the Baseline LightGBM memorizes the unique patterns in the train set due to the highly informative features.

Refer to caption
Figure 4: Comparison of the mean square error performances of Hierarchical Ensemble (black), Ensemble (blue), Base LightGBM (green), Embedded (red), Wrapper (purple) for the M4 hourly competition dataset.

The Embedded and Ensemble methods perform close to each other since the Ensemble model generally chooses the Embedded model giving αtsubscript𝛼𝑡\alpha_{t}italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT of 1 among the range of [0,1]01[0,1][ 0 , 1 ]. It’s important to note that the Embedded utilizes a more dominant feature set compared to date-related features, which indicates the inclination of Baseline LightGBM to ytsubscript𝑦𝑡y_{t}italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT-related features. Our proposed method overcomes this issue by finetuning with the less dominant features in the last level for a greater impact on y~t(2)superscriptsubscript~𝑦𝑡2\tilde{y}_{t}^{(2)}over~ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT. Moreover, The Wrapper method clearly converges to a non-optimal local minimum since it performs the worst. It also demonstrates the least robust MSE performance due to oscillatory behavior in contrast to our method. From Table 2, the extreme time consumption of this method compared to Hierarchical Ensemble also verifies the efficiency of our model. As our proposed method performs the best, we conclude that the date-related features modify the first layer predictions successfully, solving overshooting or undershooting problems. Overall, the proposed method works satisfactorily in the real-life dataset outperforming other models.

Table 2: The Comparison of Average Time Consumption in Seconds
Dataset Wrapper Hierarchical Ensemble
Synthetic 393.8 9.3552
M4 Hourly Forecasting 75.24 17.46

6 Conclusions

In this work, we proposed an ensemble feature selection method based on hierarchical stacking. On top of the important milestones of traditional stacking methods, our approach leverages a hierarchical structure that fully exploits the co-dependency between features. This hierarchical stacking involves training an initial machine learning model using a subset of the features and then updating the output of the model using another machine learning algorithm that takes the remaining features or a subset of them to adjust the first layer predictions while minimizing a custom loss. This hierarchical structure provides novelty by allowing for flexible depth in each layer and suitability to any loss function. We demonstrate the effectiveness of our approach on the synthetic and M4 competition datasets. Overall, the proposed hierarchical ensemble approach for feature selection offers a robust and scalable solution to the challenges posed by datasets with high dimensional feature sets compared to the sample size. Effectively capturing feature co-dependency and showcasing enhanced accuracy and stability in machine learning models, our method outperforms traditional and state-of-the-art machine learning models. We also provide the source code of our approach to facilitate further research and replicability of our results.

Acknowledgements This work is in part supported by the Turkish Academy of Sciences Outstanding Researcher Program.

Author Contributions AT: Conceptualization, Methodology, Software, Investigation, Resources, Validation, Visualization, Writing- original draft, review. MEA: Conceptualization, Investigation, Resources, Validation, Writing- original draft, review. ATK: Conceptualization, Investigation, Supervision, Validation, Writing, review, editing. SSK: Conceptualization, Investigation, Methodology, Supervision, Validation, Writing, review, editing.

Funding Not applicable.

Data availability The synthetically generated data that support the findings of this study is available at https://github.com/aysintumay/hierarchical-feature-selection. The extraction process of the synthetic dataset is explained in the paper in detail. The real-life data that support the findings of this study is openly available at https://www.kaggle.com/datasets/yogesh94/m4-forecasting-competition-dataset.

Code Availability The code of this work is publicly available on https://github.com/aysintumay/hierarchical-feature-selection.

Declarations

Competing Interests: The authors declare that they have no competing financial or non-financial interests that could have influenced the presented work in this paper.
Ethics Approval: The authors approve that there are no ethical concerns related to the work presented in this paper.
Consent to Participate: All authors agreed on the content and explicitly consented to the submission of the paper.
Consent for Publication: All authors who participated in this study give the publisher permission to publish this work.

References

\bibcommenthead
  • Bellman [\APACyear1957] \APACinsertmetastarbellman1957dynamic{APACrefauthors}Bellman, R.E.  \APACrefYear1957. \APACrefbtitleDynamic Programming Dynamic programming. \APACaddressPublisherPrinceton, NJPrinceton University Press. \PrintBackRefs\CurrentBib
  • Bolón-Canedo \BBA Alonso-Betanzos [\APACyear2019\APACexlab\BCnt1] \APACinsertmetastarBOLONCANEDO20191{APACrefauthors}Bolón-Canedo, V.\BCBT \BBA Alonso-Betanzos, A.  \APACrefYearMonthDay2019\BCnt1. \BBOQ\APACrefatitleEnsembles for feature selection: A review and future trends Ensembles for feature selection: A review and future trends.\BBCQ \APACjournalVolNumPagesInformation Fusion521-12, {APACrefDOI} https://doi.org/https://doi.org/10.1016/j.inffus.2018.11.008 {APACrefURL} https://www.sciencedirect.com/science/article/pii/S1566253518303440 \PrintBackRefs\CurrentBib
  • Bolón-Canedo \BBA Alonso-Betanzos [\APACyear2019\APACexlab\BCnt2] \APACinsertmetastarbolon2019ensembles{APACrefauthors}Bolón-Canedo, V.\BCBT \BBA Alonso-Betanzos, A.  \APACrefYearMonthDay2019\BCnt2. \BBOQ\APACrefatitleEnsembles for Feature Selection: A Review and Future Trends Ensembles for feature selection: A review and future trends.\BBCQ \APACjournalVolNumPagesInformation Fusion521–12, {APACrefDOI} https://doi.org/10.1016/j.inffus.2018.11.008 \PrintBackRefs\CurrentBib
  • Bolón-Canedo \BOthers. [\APACyear2014] \APACinsertmetastarbolon2014data{APACrefauthors}Bolón-Canedo, V., Sánchez-Maroño, N.\BCBL Alonso-Betanzos, A.  \APACrefYearMonthDay2014. \BBOQ\APACrefatitleData Classification Using an Ensemble of Filters Data classification using an ensemble of filters.\BBCQ \APACjournalVolNumPagesNeurocomputing13513–20, {APACrefDOI} https://doi.org/10.1016/j.neucom.2013.03.067 \PrintBackRefs\CurrentBib
  • Box \BBA Jenkins [\APACyear1970] \APACinsertmetastarbox1970time{APACrefauthors}Box, G.E.P.\BCBT \BBA Jenkins, G.M.  \APACrefYear1970. \APACrefbtitleTime Series Analysis: Forecasting and Control Time series analysis: Forecasting and control. \APACaddressPublisherSan FranciscoHolden-Day. \PrintBackRefs\CurrentBib
  • Breiman [\APACyear2001] \APACinsertmetastarbreiman2001random{APACrefauthors}Breiman, L.  \APACrefYearMonthDay2001. \BBOQ\APACrefatitleRandom Forests Random forests.\BBCQ \APACjournalVolNumPagesMachine Learning455–32, {APACrefDOI} https://doi.org/10.1023/A:1010933404324 \PrintBackRefs\CurrentBib
  • Cortes \BBA Vapnik [\APACyear1995] \APACinsertmetastarsupport_vector_networks{APACrefauthors}Cortes, C.\BCBT \BBA Vapnik, V.  \APACrefYearMonthDay1995. \BBOQ\APACrefatitleSupport Vector Networks Support vector networks.\BBCQ \APACjournalVolNumPagesMachine Learning20273-297, \PrintBackRefs\CurrentBib
  • Das [\APACyear2001] \APACinsertmetastardas2001filters{APACrefauthors}Das, S.  \APACrefYearMonthDay2001. \BBOQ\APACrefatitleFilters, Wrappers and a Boosting-Based Hybrid for Feature Selection Filters, wrappers and a boosting-based hybrid for feature selection.\BBCQ \APACrefbtitleProceedings of the International Conference on Machine Learning. Proceedings of the international conference on machine learning. \APACaddressPublisherUSA. \PrintBackRefs\CurrentBib
  • Dickey \BBA Fuller [\APACyear1979] \APACinsertmetastardickey1979distribution{APACrefauthors}Dickey, D.A.\BCBT \BBA Fuller, W.A.  \APACrefYearMonthDay1979. \BBOQ\APACrefatitleDistribution of the estimators for autoregressive time series with a unit root Distribution of the estimators for autoregressive time series with a unit root.\BBCQ \APACjournalVolNumPagesJournal of the American Statistical Association74366a427–431, \PrintBackRefs\CurrentBib
  • Du [\APACyear2019] \APACinsertmetastarml_models_favoring_yt_relateds{APACrefauthors}Du, M.  \APACrefYearMonthDay2019. \BBOQ\APACrefatitleImproving LSTM Neural Networks for Better Short-Term Wind Power Predictions Improving lstm neural networks for better short-term wind power predictions.\BBCQ \APACrefbtitle2019 IEEE 2nd International Conference on Renewable Energy and Power Engineering (REPE) 2019 ieee 2nd international conference on renewable energy and power engineering (repe) (\BPG 105-109). \PrintBackRefs\CurrentBib
  • Friedman [\APACyear1997] \APACinsertmetastarfriedman1997bias{APACrefauthors}Friedman, J.H.  \APACrefYearMonthDay1997. \BBOQ\APACrefatitleOn Bias, Variance, 0/1—Loss, and the Curse-of-Dimensionality On bias, variance, 0/1—loss, and the curse-of-dimensionality.\BBCQ \APACjournalVolNumPagesData Mining and Knowledge Discovery155–77, {APACrefDOI} https://doi.org/10.1023/A:1009778005914 \PrintBackRefs\CurrentBib
  • Fumagalli \BOthers. [\APACyear2023] \APACinsertmetastarFumagalli2022iPFI{APACrefauthors}Fumagalli, F., Muschalik, M., Hüllermeier, E.\BCBL Hammer, B.  \APACrefYearMonthDay2023. \BBOQ\APACrefatitleIncremental Permutation Feature Importance (iPFI): Towards Online Explanations on Data Streams Incremental permutation feature importance (ipfi): Towards online explanations on data streams.\BBCQ \APACjournalVolNumPagesMachine Learning, {APACrefURL} https://doi.org/10.1007/s10994-023-06385-y \PrintBackRefs\CurrentBib
  • Hancer [\APACyear2021] \APACinsertmetastarHancer2021ImprovedEvolutionary{APACrefauthors}Hancer, E.  \APACrefYearMonthDay2021. \BBOQ\APACrefatitleAn improved evolutionary wrapper-filter feature selection approach with a new initialisation scheme An improved evolutionary wrapper-filter feature selection approach with a new initialisation scheme.\BBCQ \APACjournalVolNumPagesMachine Learning, {APACrefDOI} https://doi.org/10.1007/s10994-021-05990-z {APACrefURL} https://www.mendeley.com/catalogue/53f9ff12-9a2d-3032-94d7-188d3887570d/ \PrintBackRefs\CurrentBib
  • Jenul \BOthers. [\APACyear2022] \APACinsertmetastarJenul2021UBayFS{APACrefauthors}Jenul, A., Schrunner, S., Pilz, J.\BCBL Tomic, O.  \APACrefYearMonthDay2022. \BBOQ\APACrefatitleA User-Guided Bayesian Framework for Ensemble Feature Selection in Life Science Applications (UBayFS) A user-guided bayesian framework for ensemble feature selection in life science applications (ubayfs).\BBCQ \APACjournalVolNumPagesMachine Learning, {APACrefURL} https://doi.org/10.1007/s10994-022-06221-9 \PrintBackRefs\CurrentBib
  • Ke \BOthers. [\APACyear2017] \APACinsertmetastarke2017lightgbm{APACrefauthors}Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W.\BDBLLiu, T\BHBIY.  \APACrefYearMonthDay2017. \BBOQ\APACrefatitleLightGBM: a highly efficient gradient boosting decision tree Lightgbm: a highly efficient gradient boosting decision tree.\BBCQ \APACrefbtitleProceedings of the 31st International Conference on Neural Information Processing Systems Proceedings of the 31st international conference on neural information processing systems (\BPG 3149–3157). \APACaddressPublisherRed Hook, NY, USACurran Associates Inc. \PrintBackRefs\CurrentBib
  • Kohavi \BBA John [\APACyear1997] \APACinsertmetastarkohavi1997wrappers{APACrefauthors}Kohavi, R.\BCBT \BBA John, G.H.  \APACrefYearMonthDay1997. \BBOQ\APACrefatitleWrappers for feature subset selection Wrappers for feature subset selection.\BBCQ \APACjournalVolNumPagesArtificial Intelligence971–2273–324, {APACrefDOI} https://doi.org/10.1016/s0004-3702(97)00043-x \PrintBackRefs\CurrentBib
  • Lee \BOthers. [\APACyear2020] \APACinsertmetastarwind{APACrefauthors}Lee, J., Wang, W., Harrou, F.\BCBL Sun, Y.  \APACrefYearMonthDay2020. \BBOQ\APACrefatitleWind Power Prediction Using Ensemble Learning-Based Models Wind power prediction using ensemble learning-based models.\BBCQ \APACjournalVolNumPagesIEEE Access861517-61527, {APACrefDOI} https://doi.org/10.1109/ACCESS.2020.2983234 \PrintBackRefs\CurrentBib
  • Lim \BOthers. [\APACyear2021] \APACinsertmetastarArik_2{APACrefauthors}Lim, B., Arik, S.O., Loeff, N.\BCBL Pfister, T.  \APACrefYearMonthDay2021. \BBOQ\APACrefatitleTemporal Fusion Transformers for interpretable multi-horizon time series forecasting Temporal fusion transformers for interpretable multi-horizon time series forecasting.\BBCQ \APACjournalVolNumPagesInternational Journal of Forecasting3741748-1764, {APACrefDOI} https://doi.org/https://doi.org/10.1016/j.ijforecast.2021.03.012 {APACrefURL} https://www.sciencedirect.com/science/article/pii/S0169207021000637 \PrintBackRefs\CurrentBib
  • Lin \BOthers. [\APACyear2019] \APACinsertmetastarsymmetric_uncertainty{APACrefauthors}Lin, X., Li, C., Ren, W., Luo, X.\BCBL Qi, Y.  \APACrefYearMonthDay2019. \BBOQ\APACrefatitleA new feature selection method based on symmetrical uncertainty and interaction gain A new feature selection method based on symmetrical uncertainty and interaction gain.\BBCQ \APACjournalVolNumPagesComputational Biology and Chemistry83107149, {APACrefDOI} https://doi.org/https://doi.org/10.1016/j.compbiolchem.2019.107149 {APACrefURL} https://www.sciencedirect.com/science/article/pii/S1476927118303736 \PrintBackRefs\CurrentBib
  • Natekin \BBA Knoll [\APACyear2013] \APACinsertmetastarnatekin2013gradient{APACrefauthors}Natekin, A.\BCBT \BBA Knoll, A.  \APACrefYearMonthDay2013Dec.. \BBOQ\APACrefatitleGradient Boosting Machines, a tutorial Gradient boosting machines, a tutorial.\BBCQ \APACjournalVolNumPagesFrontiers in Neurorobotics7, {APACrefDOI} https://doi.org/10.3389/fnbot.2013.00021 \PrintBackRefs\CurrentBib
  • Pearson [\APACyear1896] \APACinsertmetastarpearson1896mathematical{APACrefauthors}Pearson, K.  \APACrefYearMonthDay1896. \BBOQ\APACrefatitleMathematical Contributions to the Theory of Evolution. On a Form of Spurious Correlation Which May Arise When Indices Are Used in the Measurement of Organs Mathematical contributions to the theory of evolution. on a form of spurious correlation which may arise when indices are used in the measurement of organs.\BBCQ \APACjournalVolNumPagesProceedings of the Royal Society of London60489-498, \PrintBackRefs\CurrentBib
  • Pearson [\APACyear1901] \APACinsertmetastarpearson1901lines{APACrefauthors}Pearson, K.  \APACrefYearMonthDay1901. \BBOQ\APACrefatitleLIII. on lines and planes of closest fit to systems of points in space Liii. on lines and planes of closest fit to systems of points in space.\BBCQ \APACjournalVolNumPagesThe London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science211559–572, {APACrefDOI} https://doi.org/10.1080/14786440109462720 \PrintBackRefs\CurrentBib
  • Quinlan [\APACyear1986] \APACinsertmetastarquinlan1986induction{APACrefauthors}Quinlan, J.R.  \APACrefYearMonthDay1986. \BBOQ\APACrefatitleInduction of decision trees Induction of decision trees.\BBCQ \APACjournalVolNumPagesMachine Learning181–106, {APACrefDOI} https://doi.org/10.1007/BF00116251 \PrintBackRefs\CurrentBib
  • Saeys \BOthers. [\APACyear2008] \APACinsertmetastarsaeys2008robust{APACrefauthors}Saeys, Y., Abeel, T.\BCBL Van de Peer, Y.  \APACrefYearMonthDay2008. \BBOQ\APACrefatitleRobust Feature Selection Using Ensemble Feature Selection Techniques Robust feature selection using ensemble feature selection techniques.\BBCQ \APACrefbtitleMachine Learning and Knowledge Discovery in Databases Machine learning and knowledge discovery in databases (\BPGS 313–325). \PrintBackRefs\CurrentBib
  • Saeys \BOthers. [\APACyear2007] \APACinsertmetastarsaeys2007review{APACrefauthors}Saeys, Y., Inza, I.\BCBL Larrañaga, P.  \APACrefYearMonthDay2007. \BBOQ\APACrefatitleA review of feature selection techniques in bioinformatics A review of feature selection techniques in bioinformatics.\BBCQ \APACjournalVolNumPagesBioinformatics23192507–2517, {APACrefDOI} https://doi.org/10.1093/bioinformatics/btm344 \PrintBackRefs\CurrentBib
  • Sanz \BOthers. [\APACyear2018] \APACinsertmetastarsvm_rfe{APACrefauthors}Sanz, H., Valim, C., Vegas, E., Oller, J.\BCBL Reverter, F.  \APACrefYearMonthDay201811. \BBOQ\APACrefatitleSVM-RFE: Selection and visualization of the most relevant features through non-linear kernels Svm-rfe: Selection and visualization of the most relevant features through non-linear kernels.\BBCQ \APACjournalVolNumPagesBMC Bioinformatics19, {APACrefDOI} https://doi.org/10.1186/s12859-018-2451-4 \PrintBackRefs\CurrentBib
  • Seijo-Pardo \BOthers. [\APACyear2019] \APACinsertmetastarseijo2019develo**{APACrefauthors}Seijo-Pardo, B., Bolón-Canedo, V.\BCBL Alonso-Betanzos, A.  \APACrefYearMonthDay2019. \BBOQ\APACrefatitleOn Develo** an Automatic Threshold Applied to Feature Selection Ensembles On develo** an automatic threshold applied to feature selection ensembles.\BBCQ \APACjournalVolNumPagesInformation Fusion45227–245, {APACrefDOI} https://doi.org/10.1016/j.inffus.2018.02.007 \PrintBackRefs\CurrentBib
  • Tibshirani [\APACyear1996] \APACinsertmetastartibshirani1996regression{APACrefauthors}Tibshirani, R.  \APACrefYearMonthDay1996January. \BBOQ\APACrefatitleRegression Shrinkage and Selection Via the Lasso Regression shrinkage and selection via the lasso.\BBCQ \APACjournalVolNumPagesJournal of the Royal Statistical Society: Series B (Methodological)581267–288, {APACrefDOI} https://doi.org/10.1111/j.2517-6161.1996.tb02080.x \PrintBackRefs\CurrentBib
  • Urbanowicz \BOthers. [\APACyear2018] \APACinsertmetastarrelief{APACrefauthors}Urbanowicz, R.J., Meeker, M., La Cava, W., Olson, R.S.\BCBL Moore, J.H.  \APACrefYearMonthDay2018. \BBOQ\APACrefatitleRelief-based feature selection: Introduction and review Relief-based feature selection: Introduction and review.\BBCQ \APACjournalVolNumPagesJournal of Biomedical Informatics85189-203, {APACrefDOI} https://doi.org/https://doi.org/10.1016/j.jbi.2018.07.014 {APACrefURL} https://www.sciencedirect.com/science/article/pii/S1532046418301400 \PrintBackRefs\CurrentBib
  • Verleysen \BBA François [\APACyear2005] \APACinsertmetastarsome_other_curseofdim{APACrefauthors}Verleysen, M.\BCBT \BBA François, D.  \APACrefYearMonthDay2005. \BBOQ\APACrefatitleThe Curse of Dimensionality in Data Mining and Time Series Prediction The curse of dimensionality in data mining and time series prediction.\BBCQ J. Cabestany, A. Prieto\BCBL \BBA F. Sandoval (\BEDS), \APACrefbtitleComputational Intelligence and Bioinspired Systems Computational intelligence and bioinspired systems (\BPGS 758–770). \APACaddressPublisherBerlin, HeidelbergSpringer Berlin Heidelberg. \PrintBackRefs\CurrentBib
  • Yogesh [\APACyear2020] \APACinsertmetastaryogeshm4{APACrefauthors}Yogesh, S.  \APACrefYearMonthDay2020. \APACrefbtitleM4 Forecasting Competition Dataset. M4 forecasting competition dataset. \APAChowpublishedKaggle. {APACrefURL} https://www.kaggle.com/datasets/yogesh94/m4-forecasting-competition-dataset \APACrefnoteAccessed on Apr. 1, 2023 \PrintBackRefs\CurrentBib
  • Yu \BOthers. [\APACyear2023] \APACinsertmetastarArik_1{APACrefauthors}Yu, Q.R., Wang, R., Arik, S.\BCBL Dong, Y.  \APACrefYearMonthDay2023. \BBOQ\APACrefatitleKoopman Neural Forecaster for Time-series with Temporal Distribution Shifts Koopman neural forecaster for time-series with temporal distribution shifts.\BBCQ \APACrefbtitleProceedings of ICLR. Proceedings of iclr. \PrintBackRefs\CurrentBib
  • Škrlj \BOthers. [\APACyear2021] \APACinsertmetastarSkrlj2021ReliefE{APACrefauthors}Škrlj, B., Džeroski, S., Lavrač, N.\BCBL Petković, M.  \APACrefYearMonthDay2021. \BBOQ\APACrefatitleReliefE: Feature Ranking in High-dimensional Spaces via Manifold Embeddings Reliefe: Feature ranking in high-dimensional spaces via manifold embeddings.\BBCQ \APACjournalVolNumPagesMachine Learning, {APACrefURL} https://doi.org/10.1007/s10994-021-05998-5 \PrintBackRefs\CurrentBib