[1]\fnmMehmet Y. \surTurali \equalcontThese authors contributed equally to this work. \equalcontThese authors contributed equally to this work.

[1]\orgdivDepartment of Electrical and Electronics Engineering, \orgnameBilkent University, \orgaddress\cityAnkara, \postcode06800, \countryTurkey

AFS-BM: Enhancing Model Performance through Adaptive Feature Selection with Binary Masking

[email protected]    \fnmMehmet E. \surLorasdagi [email protected]    \fnmSuleyman S. \surKozat [email protected] *
Abstract

We study the problem of feature selection in general machine learning (ML) context, which is one of the most critical subjects in the field. Although, there exist many feature selection methods, however, these methods face challenges such as scalability, managing high-dimensional data, dealing with correlated features, adapting to variable feature importance, and integrating domain knowledge. To this end, we introduce the “Adaptive Feature Selection with Binary Masking” (AFS-BM) which remedies these problems. AFS-BM achieves this by joint optimization for simultaneous feature selection and model training. In particular, we do the joint optimization and binary masking to continuously adapt the set of features and model parameters during the training process. This approach leads to significant improvements in model accuracy and a reduction in computational requirements. We provide an extensive set of experiments where we compare AFS-BM with the established feature selection methods using well-known datasets from real-life competitions. Our results show that AFS-BM makes significant improvement in terms of accuracy and requires significantly less computational complexity. This is due to AFS-BM’s ability to dynamically adjust to the changing importance of features during the training process, which an important contribution to the field. We openly share our code for the replicability of our results and to facilitate further research.

keywords:
Machine Learning, Feature Selection, Gradient Boosting Machines, Adaptive Optimization, Binary mask, High-Dimensional Datasets

1 Introduction

The emergence of the digital era has enabled a significant increase in data across various domains, including genomics and finance. High-dimensional datasets are valuable in areas such as image processing, genomics, and finance due to the detailed information they provide [1]. These datasets can contribute to the development of highly accurate models. This influx of data, particularly in the form of high-dimensional feature sets, brings both opportunities and challenges for the machine learning domain [2]. High-dimensional data can reveal complex patterns, but it may also lead to the well-known “curse of dimensionality” [2]. The large number of features can sometimes hide true patterns, leading to models that require significant computational resources prone to overfitting and are difficult to interpret [3].

While abundant features can provide rich information, not all features contribute equally to the final task. Some might be redundant, irrelevant, or even detrimental to the performance of the model if there is not enough data to learn from [4]. The challenge lies in identifying which features are essential and which can be discarded without compromising the accuracy of the model with limited data [5, 6]. This problem of feature selection is crucial for both regression and classification tasks; as an example, this happens when the feature dimensions are comparable to the training dataset size or for non-stationary environments, i.e., when we do not have enough data to learn which features are relevant due to changing statistics.

Here, we introduce “Adaptive Feature Selection with Binary Masking” (AFS-BM), a novel method designed for feature selection in high-dimensional datasets and scenarios characterized by non-stationarity. This method is particularly adept at handling cases with insufficient data for reliably identifying relevant features, a challenge it addresses through the integration of a binary mask. This binary mask, represented as a column vector with “1”s for active features and “0”s for inactive ones, plays a pivotal role in our joint optimization framework, seamlessly combining feature selection and model training, which is normally an NP hard optimization problem. AFS-BM performs this by joint adaptation of the binary mask concurrently with the model parameter optimization. This joint and dynamic optimization enables us to fine-tune the feature set concurrently and enhance model training, resulting in a more agile and efficient identification of relevant features. Consequently, it significantly elevates the predictive accuracy of the underlying model. Our comprehensive experimentation across renowned competition datasets demonstrates the robustness and effectiveness of our method. Compared to widely used feature selection techniques, AFS-BM consistently yields substantial performance improvements. Furthermore, we also openly share our code111https://github.com/YigitTurali/AFS_BM-Algorithm to facilitate further research and replicability of our results.

Our main contributions to the literature can be summarized as follows:

  • We introduce AFS-BM that uniquely combines binary mask feature selection with joint optimization. This integration allows for dynamic feature selection, actively improving model accuracy by focusing on the most relevant features and simultaneously reducing noise by eliminating less significant ones.

  • This binary mask dynamically refine the feature set, playing a crucial role in improving the computational efficiency of the approach.

  • Our algorithm ensures a critical balance between feature selection and maintaining model accuracy by using a well-tuned and iterative approach that evaluates and prunes features based on their final impact on performance, thereby ensuring efficiency without compromising predictive precision.

  • We demonstrate the significant performance improvements over traditional feature selection techniques, particularly in its application to GBMs and NNs.222The rest of the paper is organized as follows: Section 2 outlines the current literature of the filter, wrapper, embedded and adaptive feature selection methods. Section 3 presents the mathematical background and problem description with current feature selection methods. Section 4 introduces our novel feature selection structure, detailing the algorithms and their underlying principles. Section 5 showcases our experimental results, comparing our approach with established feature selection techniques. Finally, Section 6 offers a summary of our findings and conclusions.

2 Related Work

Several established methods for feature selection exist. Filter methods such as the Chi-squared test, mutual information [7], and correlation coefficients assess features based on their inherent statistical characteristics [8]. Wrapper methods, including techniques such as sequential forward selection and sequential backward elimination, employ specific machine learning algorithms to evaluate varying feature subsets [9]. Embedded methods, represented by techniques such as LASSO and decision tree-based approaches, integrate feature selection directly into the model training process [10, 11]. However, these methods often depend on static analysis and predetermined algorithms, which can be a limitation. Specifically, methods like wrappers and embedded approaches may require significant computational resources, making them challenging to apply to large datasets [12]. While widely used, these methods often struggle with high-dimensional datasets and lack the flexibility to adapt to the evolving requirements of complex models [13]. In contrast, our method employs a binary mask coupled with iterative optimization, enabling dynamic adaptation to the learning patterns of a model. Since this approach jointly and dynamically adjust to the changing importance of features during the training process, it enhances the precision of feature selection and ensures better performance in challenging scenarios involving high-dimensional or evolving datasets. Additionally, in the context of evolving datasets, where data patterns and relationships may shift over time, the flexibility of the binary mask to adapt to these changes ensures that the model remains robust and relevant. It can adjust to new patterns or discard previously relevant features that become obsolete, maintaining its efficacy in dynamically changing environments. This adaptability is crucial for maintaining high performance over time, especially in real-world applications where data is not static and can evolve.

In recent years, there has been a shift towards adaptive feature selection methods. These methods focus on adapting to the learning behavior of the model. An example of this is the Recursive Feature Elimination (RFE) for Support Vector Machines (SVMs), where features are prioritized based on their influence on the SVM’s decision-making [14, 15]. Tree-based models, such as Random Forests and GBMs, also provide feature importance metrics derived from their tree structures [16]. These existing methods, while adaptive, are not always optimal for high-dimensional and non-stationary data, and they are designed mainly for traditional machine learning models and might not be optimal for different architectures such as for NNs [17]. In contrast, AFS-BM demonstrates superior adaptability and efficiency in handling such challenges. Its iterative refinement of feature selection through a binary mask is effective, offers more precise feature selection and making it a more versatile and robust solution because it continuously evaluates and adjusts which features are most predictive, thus improving focus of the model on truly relevant data for NNs and other advanced machine learning algorithms.

3 Problem Statement

Vectors in this manuscript are represented as column vectors using bold lowercase notation, while matrices are denoted using bold uppercase letters.333For a given vector 𝒙𝒙\bm{x}bold_italic_x and a matrix 𝑿𝑿\bm{X}bold_italic_X, their respective transposes are represented as 𝒙Tsuperscript𝒙𝑇\bm{x}^{T}bold_italic_x start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT and 𝑿Tsuperscript𝑿𝑇\bm{X}^{T}bold_italic_X start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT. The symbol direct-product\odot denotes the Hadamard product, which signifies an element-wise multiplication operation between matrices. For any vector 𝒙𝒙\bm{x}bold_italic_x, xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT represents the ithsuperscript𝑖𝑡i^{th}italic_i start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT element. For a matrix 𝑿𝑿\bm{X}bold_italic_X, Xijsubscript𝑋𝑖𝑗X_{ij}italic_X start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT indicates the element in the ithsuperscript𝑖𝑡i^{th}italic_i start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT row and jthsuperscript𝑗𝑡j^{th}italic_j start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT column. The operation ()\sum(\cdot)∑ ( ⋅ ) calculates the sum of the elements of a vector or matrix. The L1 norm of a vector 𝒙𝒙\bm{x}bold_italic_x is defined as 𝒙1=i|xi|,subscriptnorm𝒙1subscript𝑖subscript𝑥𝑖||\bm{x}||_{1}=\sum_{i}|x_{i}|,| | bold_italic_x | | start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | , where the summation runs over all elements of the vector. The number of elements in set S𝑆Sitalic_S is given by |S|𝑆|S|| italic_S |.

We address the feature selection problem in large datasets and explore the challenges associated with time series forecasting and classification tasks. We then examine GBMs and RFEs, concluding with a discussion of widely used feature selection methods in the machine learning literature, such as cross-correlation and mutual information.

We study adaptive feature selection in the context of online learning for prediction/forecasting/classification of non-stationary data, where we adaptively learn the most relevant features. Given a vector sequence 𝒙1:T={𝒙t}t=1Tsubscript𝒙:1𝑇superscriptsubscriptsubscript𝒙𝑡𝑡1𝑇\bm{x}_{1:T}=\{\bm{x}_{t}\}_{t=1}^{T}bold_italic_x start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT = { bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT, where T𝑇Titalic_T denotes the sequence length and 𝒙tMsubscript𝒙𝑡superscript𝑀\bm{x}_{t}\in\mathbb{R}^{M}bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT is the feature vector at time t𝑡titalic_t where this input can include features from the target sequence (endogenous) and auxiliary (exogenous) features such as weather or time of the day. The corresponding target output for 𝒙1:Tsubscript𝒙:1𝑇\bm{x}_{1:T}bold_italic_x start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT is 𝒚1:T={𝒚t}t=1Tsubscript𝒚:1𝑇superscriptsubscriptsubscript𝒚𝑡𝑡1𝑇\bm{y}_{1:T}=\{\bm{y}_{t}\}_{t=1}^{T}bold_italic_y start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT = { bold_italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT, with 𝒚tCsubscript𝒚𝑡superscript𝐶\bm{y}_{t}\in\mathbb{R}^{C}bold_italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT being the desired output vector at time t𝑡titalic_t, C𝐶Citalic_C where represents the number of components or dimensions in the desired output vector.

In the online learning setting, the goal is to estimate 𝒚tsubscript𝒚𝑡\bm{y}_{t}bold_italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT using only the inputs observed up to time t𝑡titalic_t as 𝒚^t=ft(𝒚1,,𝒚t1,𝒙1,,𝒙t;𝜽t)subscriptbold-^𝒚𝑡subscript𝑓𝑡subscript𝒚1subscript𝒚𝑡1subscript𝒙1subscript𝒙𝑡subscript𝜽𝑡\bm{\hat{y}}_{t}=f_{t}(\bm{y}_{1},\ldots,\bm{y}_{t-1},\bm{x}_{1},\ldots,{\bm{x% }_{t}};\bm{\theta}_{t})overbold_^ start_ARG bold_italic_y end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_italic_y start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , bold_italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), where ftsubscript𝑓𝑡f_{t}italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is a dynamic nonlinear function parameterized by 𝜽tsubscript𝜽𝑡\bm{\theta}_{t}bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. After each observation of 𝒚tsubscript𝒚𝑡\bm{y}_{t}bold_italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, we compute a loss (𝒚t,𝒚^t)subscript𝒚𝑡subscriptbold-^𝒚𝑡\mathcal{L}(\bm{y}_{t},\bm{\hat{y}}_{t})caligraphic_L ( bold_italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , overbold_^ start_ARG bold_italic_y end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) and update the algorithm parameters in real-time. As an example, for this paper, such as the mean squared error (MSE), is computed over the sequence:

MSE=1Tt=1T𝒆tT𝒆t,subscriptMSE1𝑇superscriptsubscript𝑡1𝑇superscriptsubscript𝒆𝑡𝑇subscript𝒆𝑡\mathcal{L}_{\text{MSE}}=\frac{1}{T}\sum_{t=1}^{T}\bm{e}_{t}^{T}\bm{e}_{t},caligraphic_L start_POSTSUBSCRIPT MSE end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_italic_e start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_italic_e start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , (1)

where 𝒆t=𝒚t𝒚^tsubscript𝒆𝑡subscript𝒚𝑡subscriptbold-^𝒚𝑡\bm{e}_{t}=\bm{y}_{t}-\bm{\hat{y}}_{t}bold_italic_e start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = bold_italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - overbold_^ start_ARG bold_italic_y end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT denotes the error vector at time t𝑡titalic_t. Other metrics, such as the mean absolute error (MAE), can also be considered since our method is generic and extends to other loss functions.

Additionally, we address adaptive feature selection for classification tasks. Given a set of feature vectors 𝑿=[𝒙1,,𝒙N]T𝑿superscriptsubscript𝒙1subscript𝒙𝑁𝑇\bm{X}=[\bm{x}_{1},\ldots,\bm{x}_{N}]^{T}bold_italic_X = [ bold_italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_italic_x start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT, where N𝑁Nitalic_N is the total number of samples and 𝒙iMsubscript𝒙𝑖superscript𝑀\bm{x}_{i}\in\mathbb{R}^{M}bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT is the feature vector for the ithsuperscript𝑖𝑡i^{th}italic_i start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT sample, this input can encompass both primary features and auxiliary information. For the dataset 𝑿𝑿\bm{X}bold_italic_X, the corresponding target labels are given by 𝒀=[𝒚1,,𝒚N]T𝒀superscriptsubscript𝒚1subscript𝒚𝑁𝑇\bm{Y}=[\bm{y}_{1},\ldots,\bm{y}_{N}]^{T}bold_italic_Y = [ bold_italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_italic_y start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT. Here, 𝒚i|C|subscript𝒚𝑖superscript𝐶\bm{y}_{i}\in\mathbb{R}^{|C|}bold_italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT | italic_C | end_POSTSUPERSCRIPT, where each class is denoted by c𝑐citalic_c which belongs to the set C𝐶Citalic_C and yi,c=psubscript𝑦𝑖𝑐𝑝y_{i,c}=pitalic_y start_POSTSUBSCRIPT italic_i , italic_c end_POSTSUBSCRIPT = italic_p represents the true probability of class c𝑐citalic_c for the ithsuperscript𝑖𝑡i^{th}italic_i start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT sample. The goal is to predict the class label y^i=argmaxcy^i,csubscript^𝑦𝑖subscriptargmax𝑐subscript^𝑦𝑖𝑐\hat{y}_{i}=\operatorname*{arg\,max}_{c}\hat{y}_{i,c}over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = start_OPERATOR roman_arg roman_max end_OPERATOR start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i , italic_c end_POSTSUBSCRIPT using the feature vector 𝒙isubscript𝒙𝑖\bm{x}_{i}bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT as 𝒚^i=f(𝒙i;𝜽)subscriptbold-^𝒚𝑖𝑓subscript𝒙𝑖𝜽\bm{\hat{y}}_{i}=f(\bm{x}_{i};\bm{\theta})overbold_^ start_ARG bold_italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_f ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; bold_italic_θ ), where f𝑓fitalic_f is a nonlinear classifier function parameterized by 𝜽𝜽\bm{\theta}bold_italic_θ and the corresponding target label probabilities are given by 𝒀^=[𝒚^1,,𝒚^N]Tbold-^𝒀superscriptsubscriptbold-^𝒚1subscriptbold-^𝒚𝑁𝑇\bm{\hat{Y}}=[\bm{\hat{y}}_{1},\ldots,\bm{\hat{y}}_{N}]^{T}overbold_^ start_ARG bold_italic_Y end_ARG = [ overbold_^ start_ARG bold_italic_y end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , overbold_^ start_ARG bold_italic_y end_ARG start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT where 𝒚^i[0,1]|C|subscriptbold-^𝒚𝑖superscript01𝐶\bm{\hat{y}}_{i}\in[0,1]^{|C|}overbold_^ start_ARG bold_italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ [ 0 , 1 ] start_POSTSUPERSCRIPT | italic_C | end_POSTSUPERSCRIPT. Here, 𝒚^isubscriptbold-^𝒚𝑖\bm{\hat{y}}_{i}overbold_^ start_ARG bold_italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT represents the probability vector of all class labels for the ithsuperscript𝑖𝑡i^{th}italic_i start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT sample, where each class is denoted by c𝑐citalic_c and belongs to the set C𝐶Citalic_C. Upon observing the true probability vector of labels 𝒚isubscript𝒚𝑖\bm{y}_{i}bold_italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, we incur a loss (𝒚i,𝒚^i)subscript𝒚𝑖subscriptbold-^𝒚𝑖\mathcal{L}(\bm{y}_{i},\bm{\hat{y}}_{i})caligraphic_L ( bold_italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , overbold_^ start_ARG bold_italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) and adjust the model based on this loss. The performance of the model is assessed by classification accuracy or other relevant metrics. A commonly used metric is the cross-entropy loss CE=1Ni=1Nc=1Cyi,clog(y^i,c)subscriptCE1𝑁superscriptsubscript𝑖1𝑁superscriptsubscript𝑐1𝐶subscript𝑦𝑖𝑐subscript^𝑦𝑖𝑐\mathcal{L}_{\text{CE}}=-\frac{1}{N}\sum_{i=1}^{N}\sum_{c=1}^{C}y_{i,c}\log(% \hat{y}_{i,c})caligraphic_L start_POSTSUBSCRIPT CE end_POSTSUBSCRIPT = - divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_c = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT italic_y start_POSTSUBSCRIPT italic_i , italic_c end_POSTSUBSCRIPT roman_log ( over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i , italic_c end_POSTSUBSCRIPT ) where yi,csubscript𝑦𝑖𝑐{y}_{i,c}italic_y start_POSTSUBSCRIPT italic_i , italic_c end_POSTSUBSCRIPT is the true probability of class c𝑐citalic_c for the ithsuperscript𝑖𝑡i^{th}italic_i start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT sample, and y^i,csubscript^𝑦𝑖𝑐\hat{{y}}_{i,c}over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i , italic_c end_POSTSUBSCRIPT is the predicted probability. Other metrics, such as the F1-score or Area Under the Curve (AUC), can also be used.

As the feature space expands, the risk of overfitting increases, especially when the number of features M𝑀Mitalic_M approaches the number of samples N𝑁Nitalic_N [2], often referred to as the “Curse of Dimensionality”. For example, non-stationarity can arise in regression from evolving relationships between variables [18]. In classification, non-stationarity can manifest as shifting class boundaries due to changing class distributions [19].

Among the various feature selection techniques developed, some of the most effective and widely used methods include GBMs for feature importance [20], RFE, and Greedy Methods. These methods provide robust frameworks for identifying significant features within large datasets. Building upon these foundational techniques, we propose a novel method described in the following sections.

4 Adaptive Feature Selection with Binary Masking

Our method, AFS-BM, continually refines its choice of features by utilizing a binary mask. We first introduce the “Model Optimization Phase” (4.1), detailing how features are initially selected and utilized for model training. We then continue to the “Masked Optimization & Feature Selection Phase” (4.2), where we describe the process of refining the feature set and optimizing the binary mask. The combined approach ensures a systematic and adaptive feature selection for improved model performance.

Consider a dataset at the start of the algorithm, which is denoted as 𝒟={(𝒙i,yi)}i=1N+P𝒟superscriptsubscriptsubscript𝒙𝑖subscript𝑦𝑖𝑖1𝑁𝑃\mathcal{D}=\{(\bm{x}_{i},y_{i})\}_{i=1}^{N+P}caligraphic_D = { ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N + italic_P end_POSTSUPERSCRIPT where N+P𝑁𝑃N+Pitalic_N + italic_P is the number of samples in the dataset since we will divide the dataset in further. Here, 𝒙iMsubscript𝒙𝑖superscript𝑀\bm{x}_{i}\in\mathbb{R}^{M}bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT represent the feature vector for sample i𝑖iitalic_i, and yisubscript𝑦𝑖y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are the corresponding target value in the dataset 𝒟𝒟\mathcal{D}caligraphic_D, respectively. The set of feature vectors is defined as 𝑿=[𝒙1,,𝒙N]T𝑿superscriptsubscript𝒙1subscript𝒙𝑁𝑇\bm{X}=[\bm{x}_{1},\ldots,\bm{x}_{N}]^{T}bold_italic_X = [ bold_italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_italic_x start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT and the target vector defined as 𝒚=[y1,,yN]T𝒚superscriptsubscript𝑦1subscript𝑦𝑁𝑇\bm{y}=[{y}_{1},\ldots,{y}_{N}]^{T}bold_italic_y = [ italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_y start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT. To extract important features and eliminate redundant features, here, we define a binary mask, 𝒛𝒛\bm{z}bold_italic_z as 𝒛{0,1}M𝒛superscript01𝑀\bm{z}\in\{0,1\}^{M}bold_italic_z ∈ { 0 , 1 } start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT where M𝑀Mitalic_M is the total number of features. The binary nature ensures that a feature is either selected (1) or not (0) where the set of feature vectors 𝑿𝑿\bm{X}bold_italic_X are modified by the mask with the Hadamard product, such as:

𝑿modified=𝑿𝒛.subscript𝑿𝑚𝑜𝑑𝑖𝑓𝑖𝑒𝑑direct-product𝑿𝒛\bm{X}_{modified}=\bm{X}\odot\bm{z}.bold_italic_X start_POSTSUBSCRIPT italic_m italic_o italic_d italic_i italic_f italic_i italic_e italic_d end_POSTSUBSCRIPT = bold_italic_X ⊙ bold_italic_z . (2)

Moreover, the binary mask 𝒛𝒛\bm{z}bold_italic_z can be seen as a constraint on the feature space. Therefore, the primary objective is to minimize the loss function while optimizing this mask, which closely approximates the true target values and selects the best features by using the minimum number of features as possible. This problem can be represented as:

min𝒛,𝜽(𝒚,F(𝑿𝒛,𝜽))+𝒛1M,subscript𝒛𝜽𝒚𝐹direct-product𝑿𝒛𝜽subscriptnorm𝒛1𝑀\min_{\bm{z},\bm{\theta}}\mathcal{L}(\bm{y},F(\bm{X}\odot\bm{z},\bm{\theta}))+% \frac{||\bm{z}||_{1}}{M},roman_min start_POSTSUBSCRIPT bold_italic_z , bold_italic_θ end_POSTSUBSCRIPT caligraphic_L ( bold_italic_y , italic_F ( bold_italic_X ⊙ bold_italic_z , bold_italic_θ ) ) + divide start_ARG | | bold_italic_z | | start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_ARG italic_M end_ARG , (3)

as subject to 𝒛{0,1}M𝒛superscript01𝑀\bm{z}\in\{0,1\}^{M}bold_italic_z ∈ { 0 , 1 } start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT. Note that (3) does not have a closed-form solution since it includes integer optimization [21]. Hence, we propose an iterative algorithm that overcomes the challenges posed by the integer optimization in (3). Our algorithm leverages the subgradient methods [21], and introduces a relaxation technique for the binary constraints on 𝒛𝒛\bm{z}bold_italic_z. In each iteration, the algorithm refines the approximation of 𝒛𝒛\bm{z}bold_italic_z and updates 𝜽𝜽\bm{\theta}bold_italic_θ based on the current estimate. The iterative process continues until convergence, i.e., when the change in the objective function value between consecutive iterations falls below a predefined threshold.

As a solution to the optimization problem presented in (3), our algorithm AF-BM detailed in Algorithm 1 iteratively performs both binary mask and loss optimization. The binary mask acts as a dynamic filter, allowing the algorithm to select or discard features during the learning process actively. This ensures that only the most relevant features are used, thereby improving the accuracy of the model. The iterative optimization refines both the parameters of the model and the binary mask simultaneously, ensuring that the feature selection process remains aligned with the learning objectives of the model. The algorithm continues its iterative process until the binary mask remains unchanged between iteration cycles, providing a robust and efficient solution to the problem.

The AFS-BM algorithm begins by initializing slack thresholds based on the cross-validation results, which are represented by the variables μ𝜇\muitalic_μ and β𝛽\betaitalic_β, as well as a positive real number ΔΔ\Delta\mathcal{L}roman_Δ caligraphic_L in line 1 of Algorithm 1. At the start of the algorithm, specifically at iteration k=0𝑘0k=0italic_k = 0, two distinct datasets are used. The first dataset, denoted by 𝒟(0)superscript𝒟0\mathcal{D}^{(0)}caligraphic_D start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT, is used for model optimization and is defined as 𝒟(0)={(𝒙i,yi)}i=1Nsuperscript𝒟0superscriptsubscriptsubscript𝒙𝑖subscript𝑦𝑖𝑖1𝑁\mathcal{D}^{(0)}=\{(\bm{x}_{i},y_{i})\}_{i=1}^{N}caligraphic_D start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT = { ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT. The second dataset, represented by 𝒟m(0)superscriptsubscript𝒟𝑚0\mathcal{D}_{m}^{(0)}caligraphic_D start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT, is employed for the masked feature selection process and is given by 𝒟m(0)={(𝒙i(m),yi(m))}i=1Psuperscriptsubscript𝒟𝑚0superscriptsubscriptsuperscriptsubscript𝒙𝑖𝑚superscriptsubscript𝑦𝑖𝑚𝑖1𝑃\mathcal{D}_{m}^{(0)}=\{(\bm{x}_{i}^{(m)},y_{i}^{(m)})\}_{i=1}^{P}caligraphic_D start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT = { ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT. Both the definitions of 𝒟m(0)superscriptsubscript𝒟𝑚0\mathcal{D}_{m}^{(0)}caligraphic_D start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT and 𝒟m(0)superscriptsubscript𝒟𝑚0\mathcal{D}_{m}^{(0)}caligraphic_D start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT are given in line 2. In the notation, the superscript in 𝒟(0)superscript𝒟0\mathcal{D}^{(0)}caligraphic_D start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT and 𝒟m(0)superscriptsubscript𝒟𝑚0\mathcal{D}_{m}^{(0)}caligraphic_D start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT indicates the iteration number, which in this case is the initial iteration, i.e., 0 and the subscript “m𝑚mitalic_m” in 𝒟msubscript𝒟𝑚\mathcal{D}_{m}caligraphic_D start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT signifies that this dataset is specifically used for “masked” feature selection. Here, N𝑁Nitalic_N and P𝑃Pitalic_P are the numbers of samples in the datasets. 𝒙i,𝒙i(m)Msubscript𝒙𝑖superscriptsubscript𝒙𝑖𝑚superscript𝑀\bm{x}_{i},\bm{x}_{i}^{(m)}\in\mathbb{R}^{M}bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT represent the feature vectors for sample i𝑖iitalic_i, and yi,yi(m)subscript𝑦𝑖superscriptsubscript𝑦𝑖𝑚y_{i},y_{i}^{(m)}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT are the corresponding target values in model optimization and masked feature selection datasets, respectively. The set of feature vectors is defined as 𝑿(0)=[𝒙1,,𝒙N]Tsuperscript𝑿0superscriptsubscript𝒙1subscript𝒙𝑁𝑇\bm{X}^{(0)}=[\bm{x}_{1},\ldots,\bm{x}_{N}]^{T}bold_italic_X start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT = [ bold_italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_italic_x start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT and 𝑿m(0)=[𝒙1(m),,𝒙P(m)]Tsubscriptsuperscript𝑿0𝑚superscriptsuperscriptsubscript𝒙1𝑚superscriptsubscript𝒙𝑃𝑚𝑇\bm{X}^{(0)}_{m}=[\bm{x}_{1}^{(m)},\ldots,\bm{x}_{P}^{(m)}]^{T}bold_italic_X start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = [ bold_italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT , … , bold_italic_x start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT in line 3. The process begins with the initialization of the binary mask vector in line 4, 𝒛(0){0,1}Msuperscript𝒛0superscript01𝑀\bm{z}^{(0)}\in\{0,1\}^{M}bold_italic_z start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ∈ { 0 , 1 } start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT, with all entries set to one, indicating the inclusion of all features. Our goal is to train a model and optimize a binary mask accordingly, Fk(𝒙i,𝜽)subscript𝐹𝑘subscript𝒙𝑖𝜽F_{k}(\bm{x}_{i},\bm{\theta})italic_F start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_θ ) and 𝒛𝒛\bm{z}bold_italic_z, that selects best features at the k𝑘kitalic_k-th iteration.

4.1 Model Optimization Phase of AFS-BM

In this phase, the focus is on using a binary mask from a prior iteration to identify the feature subset, which subsequently results in the formation of a masked dataset. The main objective is to fine-tune the model by reducing a given loss function while kee** the mask unchanged. After this optimization, a test loss is calculated and compared with a benchmark loss calculated on a separate masked dataset. This threshold enables the upcoming feature selection phase, ensuring the adaptive feature extraction.

The main loop of the algorithm continues until a stop** criterion, defined by β0𝛽0\beta\neq 0italic_β ≠ 0, is met at line 5. Within this loop, before the k𝑘kitalic_k-th feature selection iteration, the feature subset is determined by the binary mask 𝒛(k1)superscript𝒛𝑘1\bm{z}^{(k-1)}bold_italic_z start_POSTSUPERSCRIPT ( italic_k - 1 ) end_POSTSUPERSCRIPT, which is optimal at iteration k1𝑘1k-1italic_k - 1 concerning the minimization of the loss function from the previous iteration. This means that the mask 𝒛(k1)superscript𝒛𝑘1\bm{z}^{(k-1)}bold_italic_z start_POSTSUPERSCRIPT ( italic_k - 1 ) end_POSTSUPERSCRIPT was found to provide the best feature subset that resulted in the lowest loss or error for the model during the k1𝑘1k-1italic_k - 1 iteration. First, at line 6, the algorithm initializes a model Fk(𝑿(k),𝜽)subscript𝐹𝑘superscript𝑿𝑘𝜽F_{k}(\bm{X}^{(k)},\bm{\theta})italic_F start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_X start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT , bold_italic_θ ), and the 𝒛(k1)superscript𝒛𝑘1\bm{z}^{(k-1)}bold_italic_z start_POSTSUPERSCRIPT ( italic_k - 1 ) end_POSTSUPERSCRIPT masks the feature vector and generate 𝒙imaskedsuperscriptsubscript𝒙𝑖𝑚𝑎𝑠𝑘𝑒𝑑\bm{x}_{i}^{masked}bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m italic_a italic_s italic_k italic_e italic_d end_POSTSUPERSCRIPT for the model training in line 7 consecutively 𝒙imasked=𝒙i𝒛(k1)superscriptsubscript𝒙𝑖𝑚𝑎𝑠𝑘𝑒𝑑direct-productsubscript𝒙𝑖superscript𝒛𝑘1\bm{x}_{i}^{masked}=\bm{x}_{i}\odot\bm{z}^{(k-1)}bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m italic_a italic_s italic_k italic_e italic_d end_POSTSUPERSCRIPT = bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⊙ bold_italic_z start_POSTSUPERSCRIPT ( italic_k - 1 ) end_POSTSUPERSCRIPT which resulted in the following dataset in line 8, 𝒟(k)={(𝒙imasked,yi)}i=1Nsuperscript𝒟𝑘superscriptsubscriptsuperscriptsubscript𝒙𝑖𝑚𝑎𝑠𝑘𝑒𝑑subscript𝑦𝑖𝑖1𝑁\mathcal{D}^{(k)}=\{(\bm{x}_{i}^{masked},y_{i})\}_{i=1}^{N}caligraphic_D start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT = { ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m italic_a italic_s italic_k italic_e italic_d end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT. Hence, 𝑿(k)=[𝒙1masked,,𝒙Nmasked]Tsuperscript𝑿𝑘superscriptsubscriptsuperscript𝒙𝑚𝑎𝑠𝑘𝑒𝑑1subscriptsuperscript𝒙𝑚𝑎𝑠𝑘𝑒𝑑𝑁𝑇\bm{X}^{(k)}=[\bm{x}^{masked}_{1},\ldots,\bm{x}^{masked}_{N}]^{T}bold_italic_X start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT = [ bold_italic_x start_POSTSUPERSCRIPT italic_m italic_a italic_s italic_k italic_e italic_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_italic_x start_POSTSUPERSCRIPT italic_m italic_a italic_s italic_k italic_e italic_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT,where 𝒙imaskedMsubscriptsuperscript𝒙𝑚𝑎𝑠𝑘𝑒𝑑𝑖superscript𝑀\bm{x}^{masked}_{i}\in\mathbb{R}^{M}bold_italic_x start_POSTSUPERSCRIPT italic_m italic_a italic_s italic_k italic_e italic_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT is the feature vector for the ithsuperscript𝑖𝑡i^{th}italic_i start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT sample. Training involves minimizing the loss function (𝒚,Fk(𝑿(k),𝜽))𝒚subscript𝐹𝑘superscript𝑿𝑘𝜽\mathcal{L}(\bm{y},F_{k}(\bm{X}^{(k)},\bm{\theta}))caligraphic_L ( bold_italic_y , italic_F start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_X start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT , bold_italic_θ ) ) with respect to the parameters of the model, 𝜽𝜽\bm{\theta}bold_italic_θ since the mask 𝒛(k1)superscript𝒛𝑘1\bm{z}^{(k-1)}bold_italic_z start_POSTSUPERSCRIPT ( italic_k - 1 ) end_POSTSUPERSCRIPT is constant at this step which can be seen on line 10:

𝜽^k=argmin𝜽(𝒚,Fk(𝑿(k),𝜽)).subscriptbold-^𝜽𝑘subscriptargmin𝜽𝒚subscript𝐹𝑘superscript𝑿𝑘𝜽\bm{\hat{\theta}}_{k}=\operatorname*{arg\,min}_{\bm{\theta}}\mathcal{L}(\bm{y}% ,F_{k}(\bm{X}^{(k)},\bm{\theta})).overbold_^ start_ARG bold_italic_θ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT caligraphic_L ( bold_italic_y , italic_F start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_X start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT , bold_italic_θ ) ) . (4)

After optimizing the model, we determine the test loss value using the current best parameters, denoted as 𝜽^ksubscriptbold-^𝜽𝑘\bm{\hat{\theta}}_{k}overbold_^ start_ARG bold_italic_θ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT. This loss value is then defined as thsubscript𝑡\mathcal{L}_{th}caligraphic_L start_POSTSUBSCRIPT italic_t italic_h end_POSTSUBSCRIPT and will serve as a reference point or threshold for the subsequent phase of the algorithm. To calculate thsubscript𝑡\mathcal{L}_{th}caligraphic_L start_POSTSUBSCRIPT italic_t italic_h end_POSTSUBSCRIPT, we use the masked feature selection dataset 𝒟msubscript𝒟𝑚\mathcal{D}_{m}caligraphic_D start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT as a test set for the model optimization phase of the AFS-BM algorithm. First, we mask the feature vector with the most recent optimal mask 𝒛(k1)superscript𝒛𝑘1\bm{z}^{(k-1)}bold_italic_z start_POSTSUPERSCRIPT ( italic_k - 1 ) end_POSTSUPERSCRIPT from the previous iteration and generate 𝒙i(m)maskedsuperscriptsuperscriptsubscript𝒙𝑖𝑚𝑚𝑎𝑠𝑘𝑒𝑑{\bm{x}_{i}^{(m)}}^{masked}bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT italic_m italic_a italic_s italic_k italic_e italic_d end_POSTSUPERSCRIPT. Then, we update 𝒟m(k1)superscriptsubscript𝒟𝑚𝑘1\mathcal{D}_{m}^{(k-1)}caligraphic_D start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k - 1 ) end_POSTSUPERSCRIPT to calculate the thsubscript𝑡\mathcal{L}_{th}caligraphic_L start_POSTSUBSCRIPT italic_t italic_h end_POSTSUBSCRIPT from line 11 to 14:

𝒙i(m)masked=𝒙i(m)𝒛(k1),superscriptsuperscriptsubscript𝒙𝑖𝑚𝑚𝑎𝑠𝑘𝑒𝑑direct-productsuperscriptsubscript𝒙𝑖𝑚superscript𝒛𝑘1\displaystyle{\bm{x}_{i}^{(m)}}^{masked}=\bm{x}_{i}^{(m)}\odot\bm{z}^{(k-1)},bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT italic_m italic_a italic_s italic_k italic_e italic_d end_POSTSUPERSCRIPT = bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ⊙ bold_italic_z start_POSTSUPERSCRIPT ( italic_k - 1 ) end_POSTSUPERSCRIPT , (5)
𝒟m(k1)={(𝒙i(m)masked,yi(m))}i=1P,superscriptsubscript𝒟𝑚𝑘1superscriptsubscriptsuperscriptsuperscriptsubscript𝒙𝑖𝑚𝑚𝑎𝑠𝑘𝑒𝑑superscriptsubscript𝑦𝑖𝑚𝑖1𝑃\displaystyle\mathcal{D}_{m}^{(k-1)}=\{({\bm{x}_{i}^{(m)}}^{masked},y_{i}^{(m)% })\}_{i=1}^{P},caligraphic_D start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k - 1 ) end_POSTSUPERSCRIPT = { ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT italic_m italic_a italic_s italic_k italic_e italic_d end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT , (6)
𝑿m(k1)=[𝒙1(m)masked,,𝒙P(m)masked]T,superscriptsubscript𝑿𝑚𝑘1superscriptsuperscriptsuperscriptsubscript𝒙1𝑚𝑚𝑎𝑠𝑘𝑒𝑑superscriptsuperscriptsubscript𝒙𝑃𝑚𝑚𝑎𝑠𝑘𝑒𝑑𝑇\displaystyle\bm{X}_{m}^{(k-1)}=[{\bm{x}_{1}^{(m)}}^{masked},\ldots,{\bm{x}_{P% }^{(m)}}^{masked}]^{T},bold_italic_X start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k - 1 ) end_POSTSUPERSCRIPT = [ bold_italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT italic_m italic_a italic_s italic_k italic_e italic_d end_POSTSUPERSCRIPT , … , bold_italic_x start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT italic_m italic_a italic_s italic_k italic_e italic_d end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT , (7)
th=(𝒚(m),Fk(𝑿m(k1),𝜽^k)),subscript𝑡superscript𝒚𝑚subscript𝐹𝑘superscriptsubscript𝑿𝑚𝑘1subscriptbold-^𝜽𝑘\displaystyle\mathcal{L}_{th}=\mathcal{L}(\bm{y}^{(m)},F_{k}(\bm{X}_{m}^{(k-1)% },\bm{\hat{\theta}}_{k})),caligraphic_L start_POSTSUBSCRIPT italic_t italic_h end_POSTSUBSCRIPT = caligraphic_L ( bold_italic_y start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT , italic_F start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_X start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k - 1 ) end_POSTSUPERSCRIPT , overbold_^ start_ARG bold_italic_θ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) , (8)

where we define a test set of feature vectors at step (k1)𝑘1(k-1)( italic_k - 1 ) as 𝑿m(k1)=[𝒙1(m)masked,,𝒙P(m)masked]Tsuperscriptsubscript𝑿𝑚𝑘1superscriptsuperscriptsuperscriptsubscript𝒙1𝑚𝑚𝑎𝑠𝑘𝑒𝑑superscriptsuperscriptsubscript𝒙𝑃𝑚𝑚𝑎𝑠𝑘𝑒𝑑𝑇\bm{X}_{m}^{(k-1)}=[{\bm{x}_{1}^{(m)}}^{masked},\ldots,{\bm{x}_{P}^{(m)}}^{% masked}]^{T}bold_italic_X start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k - 1 ) end_POSTSUPERSCRIPT = [ bold_italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT italic_m italic_a italic_s italic_k italic_e italic_d end_POSTSUPERSCRIPT , … , bold_italic_x start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT italic_m italic_a italic_s italic_k italic_e italic_d end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT, where 𝒙i(m)maskedMsuperscriptsuperscriptsubscript𝒙𝑖𝑚𝑚𝑎𝑠𝑘𝑒𝑑superscript𝑀{\bm{x}_{i}^{(m)}}^{masked}\in\mathbb{R}^{M}bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT italic_m italic_a italic_s italic_k italic_e italic_d end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT is the masked feature vector for the ithsuperscript𝑖𝑡i^{th}italic_i start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT sample of masked feature selection dataset. After calculating the threshold loss thsubscript𝑡\mathcal{L}_{th}caligraphic_L start_POSTSUBSCRIPT italic_t italic_h end_POSTSUBSCRIPT, we proceed with the masked feature selection phase.

4.2 Masked Optimization & Feature Selection Phase of AFS-BM

In this phase, the AFS-BM algorithm undergoes a thorough feature extraction via mask optimization after training. It sets a slack variable to guide the optimization duration and uses a temporal mask to evaluate feature relevance. By iteratively masking out features and observing the impact on the loss of the model, the algorithm differentiates between essential and redundant features. This iterative process continues until the feature set stabilizes, ensuring the model is equipped with the most significant features for the minimum loss on the given dataset.

After training, the mask optimization and feature selection process starts. In this process we use the slack variable μ+𝜇superscript\mu\in\mathbb{Z}^{+}italic_μ ∈ blackboard_Z start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT to stop the mask optimization process, which is initialized in line 1 and adjusted based on cross-validation and only affects the algorithm’s computation time of the algorithm. The mask optimization phase starts with initializing a temporal mask 𝒛^(k1)superscriptbold-^𝒛𝑘1\bm{\hat{z}}^{(k-1)}overbold_^ start_ARG bold_italic_z end_ARG start_POSTSUPERSCRIPT ( italic_k - 1 ) end_POSTSUPERSCRIPT that copies the binary mask from the previous iteration. We define the temporal mask 𝒛^(k1)superscriptbold-^𝒛𝑘1\bm{\hat{z}}^{(k-1)}overbold_^ start_ARG bold_italic_z end_ARG start_POSTSUPERSCRIPT ( italic_k - 1 ) end_POSTSUPERSCRIPT at line 15 as:

𝒛^(k1)𝒛(k1).superscriptbold-^𝒛𝑘1superscript𝒛𝑘1\bm{\hat{z}}^{(k-1)}\leftarrow\bm{z}^{(k-1)}.overbold_^ start_ARG bold_italic_z end_ARG start_POSTSUPERSCRIPT ( italic_k - 1 ) end_POSTSUPERSCRIPT ← bold_italic_z start_POSTSUPERSCRIPT ( italic_k - 1 ) end_POSTSUPERSCRIPT . (9)

As long as μ0𝜇0\mu\neq 0italic_μ ≠ 0, which is met at line 17, we continue with the index selection procedure. During this phase, the algorithm selects a unique index i𝑖iitalic_i from a set of available indices S𝑆Sitalic_S based on a Gaussian distribution from lines 18 to 19. This index selection procedure is described as follows:

Let 𝒮={1,,M}𝒮1𝑀\mathcal{S}={\{1,\ldots,M\}}caligraphic_S = { 1 , … , italic_M } be a set of indices with dimension dim𝒮=Mdimension𝒮𝑀\dim{\mathcal{S}}=Mroman_dim caligraphic_S = italic_M. An index i𝑖iitalic_i is selected from 𝒮𝒮\mathcal{S}caligraphic_S such that i𝒮𝑖𝒮i\in\mathcal{S}italic_i ∈ caligraphic_S. The selection of index i𝑖iitalic_i from 𝒮𝒮\mathcal{S}caligraphic_S is based on a uniform distribution. This probabilistic mechanism enhances the adaptiveness and diversity of the feature selection process. Once chosen, i𝑖iitalic_i cannot be reselected, preserving the uniqueness of the selection. This is expressed as i𝒮,i{jj has been previously selectedi\in\mathcal{S},\quad i\notin\{j\mid j\text{ has been previously selected}italic_i ∈ caligraphic_S , italic_i ∉ { italic_j ∣ italic_j has been previously selected.

Upon selecting index i𝑖iitalic_i, we set corresponding element of the temporal mask z^i(k1)superscriptsubscript^𝑧𝑖𝑘1{\hat{z}_{i}}^{(k-1)}over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k - 1 ) end_POSTSUPERSCRIPT to 0 as at line 20, z^i(k1)0superscriptsubscript^𝑧𝑖𝑘10\hat{z}_{i}^{(k-1)}\leftarrow 0over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k - 1 ) end_POSTSUPERSCRIPT ← 0.

With the temporal mask, 𝒛^(k1)superscriptbold-^𝒛𝑘1\bm{\hat{z}}^{(k-1)}overbold_^ start_ARG bold_italic_z end_ARG start_POSTSUPERSCRIPT ( italic_k - 1 ) end_POSTSUPERSCRIPT, we generate temporal feature vectors 𝒙^i(m)maskedsuperscriptsubscriptbold-^𝒙𝑖superscript𝑚𝑚𝑎𝑠𝑘𝑒𝑑{\bm{\hat{x}}_{i}^{(m)^{masked}}}overbold_^ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_m ) start_POSTSUPERSCRIPT italic_m italic_a italic_s italic_k italic_e italic_d end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT, a feature vector set 𝑿^m(k)superscriptsubscriptbold-^𝑿𝑚𝑘\bm{\hat{X}}_{m}^{(k)}overbold_^ start_ARG bold_italic_X end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT and a dataset 𝒟^m(k)superscriptsubscript^𝒟𝑚𝑘\mathcal{\hat{D}}_{m}^{(k)}over^ start_ARG caligraphic_D end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT from lines 21 to 24:

𝒙^i(m)masked=𝒙i(m)𝒛^(k1),superscriptsubscriptbold-^𝒙𝑖superscript𝑚𝑚𝑎𝑠𝑘𝑒𝑑direct-productsuperscriptsubscript𝒙𝑖𝑚superscriptbold-^𝒛𝑘1\displaystyle{\bm{\hat{x}}_{i}^{(m)^{masked}}}=\bm{x}_{i}^{(m)}\odot\bm{\hat{z% }}^{(k-1)},overbold_^ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_m ) start_POSTSUPERSCRIPT italic_m italic_a italic_s italic_k italic_e italic_d end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT = bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ⊙ overbold_^ start_ARG bold_italic_z end_ARG start_POSTSUPERSCRIPT ( italic_k - 1 ) end_POSTSUPERSCRIPT , (10)
𝒟^m(k)={(𝒙^i(m)masked,yi(m))}i=1P,superscriptsubscript^𝒟𝑚𝑘superscriptsubscriptsuperscriptsubscriptbold-^𝒙𝑖superscript𝑚𝑚𝑎𝑠𝑘𝑒𝑑superscriptsubscript𝑦𝑖𝑚𝑖1𝑃\displaystyle\mathcal{\hat{D}}_{m}^{(k)}=\{({\bm{\hat{x}}_{i}^{(m)^{masked}}},% y_{i}^{(m)})\}_{i=1}^{P},over^ start_ARG caligraphic_D end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT = { ( overbold_^ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_m ) start_POSTSUPERSCRIPT italic_m italic_a italic_s italic_k italic_e italic_d end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT , (11)
𝑿^m(k)=[𝒙^1(m)masked,,𝒙^N(m)masked]T,superscriptsubscriptbold-^𝑿𝑚𝑘superscriptsuperscriptsubscriptbold-^𝒙1superscript𝑚𝑚𝑎𝑠𝑘𝑒𝑑superscriptsubscriptbold-^𝒙𝑁superscript𝑚𝑚𝑎𝑠𝑘𝑒𝑑𝑇\displaystyle\bm{\hat{X}}_{m}^{(k)}=[\bm{\hat{x}}_{1}^{(m)^{masked}},\ldots,% \bm{\hat{x}}_{N}^{(m)^{masked}}]^{T},overbold_^ start_ARG bold_italic_X end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT = [ overbold_^ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_m ) start_POSTSUPERSCRIPT italic_m italic_a italic_s italic_k italic_e italic_d end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT , … , overbold_^ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_m ) start_POSTSUPERSCRIPT italic_m italic_a italic_s italic_k italic_e italic_d end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT , (12)
mask=(𝒚(m),Fk(𝑿^m(k),𝜽^k)),subscript𝑚𝑎𝑠𝑘superscript𝒚𝑚subscript𝐹𝑘superscriptsubscriptbold-^𝑿𝑚𝑘subscriptbold-^𝜽𝑘\displaystyle\mathcal{L}_{mask}=\mathcal{L}(\bm{y}^{(m)},F_{k}(\bm{\hat{X}}_{m% }^{(k)},\bm{\hat{\theta}}_{k})),caligraphic_L start_POSTSUBSCRIPT italic_m italic_a italic_s italic_k end_POSTSUBSCRIPT = caligraphic_L ( bold_italic_y start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT , italic_F start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( overbold_^ start_ARG bold_italic_X end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT , overbold_^ start_ARG bold_italic_θ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) , (13)

and 𝑿^m(k)=𝑿m(k1){i}superscriptsubscriptbold-^𝑿𝑚𝑘superscriptsubscript𝑿𝑚𝑘1𝑖\bm{\hat{X}}_{m}^{(k)}=\bm{X}_{m}^{(k-1)}\setminus\{i\}overbold_^ start_ARG bold_italic_X end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT = bold_italic_X start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k - 1 ) end_POSTSUPERSCRIPT ∖ { italic_i } where 𝑿m(k1){i}superscriptsubscript𝑿𝑚𝑘1𝑖\bm{X}_{m}^{(k-1)}\setminus\{i\}bold_italic_X start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k - 1 ) end_POSTSUPERSCRIPT ∖ { italic_i } denotes the feature set without feature i𝑖iitalic_i. We define the masksubscript𝑚𝑎𝑠𝑘\mathcal{L}_{mask}caligraphic_L start_POSTSUBSCRIPT italic_m italic_a italic_s italic_k end_POSTSUBSCRIPT as mask=(𝒚(m),Fk(𝑿m(t1){i},𝜽^t))subscript𝑚𝑎𝑠𝑘superscript𝒚𝑚subscript𝐹𝑘superscriptsubscript𝑿𝑚𝑡1𝑖subscriptbold-^𝜽𝑡\mathcal{L}_{mask}=\mathcal{L}(\bm{y}^{(m)},F_{k}(\bm{X}_{m}^{(t-1)}\setminus% \{i\},\bm{\hat{\theta}}_{t}))caligraphic_L start_POSTSUBSCRIPT italic_m italic_a italic_s italic_k end_POSTSUBSCRIPT = caligraphic_L ( bold_italic_y start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT , italic_F start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_X start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t - 1 ) end_POSTSUPERSCRIPT ∖ { italic_i } , overbold_^ start_ARG bold_italic_θ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ).

With the generated temporal feature vectors using the updated temporal mask and the algorithm calculates a new loss masksubscript𝑚𝑎𝑠𝑘\mathcal{L}_{mask}caligraphic_L start_POSTSUBSCRIPT italic_m italic_a italic_s italic_k end_POSTSUBSCRIPT in line 25. Moreover, we define that a feature i𝑖iitalic_i is said to be relevant if and only if (𝒚(m),Fk(𝑿m(k1){i},𝜽^k))(𝒚(m),Fk(𝑿m(k1),𝜽^k)))(𝒚(m),Fk(𝑿m(k1),𝜽^k)))Δ\frac{\mathcal{L}(\bm{y}^{(m)},F_{k}(\bm{X}_{m}^{(k-1)}\setminus\{i\},\bm{\hat% {\theta}}_{k}))-\mathcal{L}(\bm{y}^{(m)},F_{k}(\bm{X}_{m}^{(k-1)},\bm{\hat{% \theta}}_{k})))}{\mathcal{L}(\bm{y}^{(m)},F_{k}(\bm{X}_{m}^{(k-1)},\bm{\hat{% \theta}}_{k})))}\leq\Delta\mathcal{L}divide start_ARG caligraphic_L ( bold_italic_y start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT , italic_F start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_X start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k - 1 ) end_POSTSUPERSCRIPT ∖ { italic_i } , overbold_^ start_ARG bold_italic_θ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) - caligraphic_L ( bold_italic_y start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT , italic_F start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_X start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k - 1 ) end_POSTSUPERSCRIPT , overbold_^ start_ARG bold_italic_θ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) ) end_ARG start_ARG caligraphic_L ( bold_italic_y start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT , italic_F start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_X start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k - 1 ) end_POSTSUPERSCRIPT , overbold_^ start_ARG bold_italic_θ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) ) end_ARG ≤ roman_Δ caligraphic_L or maskththΔsubscript𝑚𝑎𝑠𝑘subscript𝑡subscript𝑡Δ\frac{\mathcal{L}_{mask}-\mathcal{L}_{th}}{\mathcal{L}_{th}}\leq\Delta\mathcal% {L}divide start_ARG caligraphic_L start_POSTSUBSCRIPT italic_m italic_a italic_s italic_k end_POSTSUBSCRIPT - caligraphic_L start_POSTSUBSCRIPT italic_t italic_h end_POSTSUBSCRIPT end_ARG start_ARG caligraphic_L start_POSTSUBSCRIPT italic_t italic_h end_POSTSUBSCRIPT end_ARG ≤ roman_Δ caligraphic_L, where 𝑿m(k1){i}superscriptsubscript𝑿𝑚𝑘1𝑖\bm{X}_{m}^{(k-1)}\setminus\{i\}bold_italic_X start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k - 1 ) end_POSTSUPERSCRIPT ∖ { italic_i } denotes the feature set without feature i𝑖iitalic_i and ΔΔ\Delta\mathcal{L}roman_Δ caligraphic_L denotes a predefined threshold which is also initialized at line 1 and adjusted by cross-validation and only affects the sensitivity of determining feature’s relevance. The mask’s state is retained if the feature is relevant, i.e., the loss remains unchanged or changes by less than a predefined threshold, ΔΔ\Delta\mathcal{L}roman_Δ caligraphic_L as z^i(k1)={0,if maskththΔ1,otherwise.subscriptsuperscript^𝑧𝑘1𝑖cases0if subscript𝑚𝑎𝑠𝑘subscript𝑡subscript𝑡Δ1otherwise{\hat{z}}^{(k-1)}_{i}=\begin{cases}0,&\text{if }\frac{\mathcal{L}_{mask}-% \mathcal{L}_{th}}{\mathcal{L}_{th}}\leq\Delta\mathcal{L}\\ 1,&\text{otherwise}.\end{cases}over^ start_ARG italic_z end_ARG start_POSTSUPERSCRIPT ( italic_k - 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = { start_ROW start_CELL 0 , end_CELL start_CELL if divide start_ARG caligraphic_L start_POSTSUBSCRIPT italic_m italic_a italic_s italic_k end_POSTSUBSCRIPT - caligraphic_L start_POSTSUBSCRIPT italic_t italic_h end_POSTSUBSCRIPT end_ARG start_ARG caligraphic_L start_POSTSUBSCRIPT italic_t italic_h end_POSTSUBSCRIPT end_ARG ≤ roman_Δ caligraphic_L end_CELL end_ROW start_ROW start_CELL 1 , end_CELL start_CELL otherwise . end_CELL end_ROW

If the loss does not exceed the threshold, we update thsubscriptth\mathcal{L}_{\text{th}}caligraphic_L start_POSTSUBSCRIPT th end_POSTSUBSCRIPT to masksubscriptmask\mathcal{L}_{\text{mask}}caligraphic_L start_POSTSUBSCRIPT mask end_POSTSUBSCRIPT. Otherwise, we decrement μ𝜇\muitalic_μ and move to the next randomly chosen index. Let δ𝛿\delta\mathcal{L}italic_δ caligraphic_L denote the relative change in the loss function when feature i𝑖iitalic_i is removed, i.e., when z^i(k1)subscriptsuperscript^𝑧𝑘1𝑖{\hat{z}}^{(k-1)}_{i}over^ start_ARG italic_z end_ARG start_POSTSUPERSCRIPT ( italic_k - 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is set to 0. Formally, δ=maskthth𝛿subscript𝑚𝑎𝑠𝑘subscript𝑡subscript𝑡\delta\mathcal{L}=\frac{\mathcal{L}_{mask}-\mathcal{L}_{th}}{\mathcal{L}_{th}}italic_δ caligraphic_L = divide start_ARG caligraphic_L start_POSTSUBSCRIPT italic_m italic_a italic_s italic_k end_POSTSUBSCRIPT - caligraphic_L start_POSTSUBSCRIPT italic_t italic_h end_POSTSUBSCRIPT end_ARG start_ARG caligraphic_L start_POSTSUBSCRIPT italic_t italic_h end_POSTSUBSCRIPT end_ARG. Given the threshold ΔΔ\Delta\mathcal{L}roman_Δ caligraphic_L, a feature i𝑖iitalic_i is deemed irrelevant if δΔ𝛿Δ\delta\mathcal{L}\leq\Delta\mathcal{L}italic_δ caligraphic_L ≤ roman_Δ caligraphic_L. This implies that the performance of the model does not degrade by more than ΔΔ\Delta\mathcal{L}roman_Δ caligraphic_L when feature i𝑖iitalic_i is removed. Therefore, for any feature i𝑖iitalic_i for which δΔ𝛿Δ\delta\mathcal{L}\leq\Delta\mathcal{L}italic_δ caligraphic_L ≤ roman_Δ caligraphic_L, the binary mask ensures that z^i(k1)subscriptsuperscript^𝑧𝑘1𝑖{\hat{z}}^{(k-1)}_{i}over^ start_ARG italic_z end_ARG start_POSTSUPERSCRIPT ( italic_k - 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, ensuring that only features that significantly contribute to the performance of the model are retained. This process is shown in from lines 26 to 33. The mask optimization concludes when μ=0𝜇0\mu=0italic_μ = 0. Once the mask optimization phase concludes, at line 34, we then employ the optimized mask to eliminate redundant features:

DeleteColumns(𝑿(k),𝑿m(k1),𝒛^(k1))=DeleteColumnssuperscript𝑿𝑘superscriptsubscript𝑿𝑚𝑘1superscriptbold-^𝒛𝑘1absent\displaystyle\textbf{DeleteColumns}(\bm{X}^{(k)},\bm{X}_{m}^{(k-1)},\bm{\hat{z% }}^{(k-1)})=DeleteColumns ( bold_italic_X start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT , bold_italic_X start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k - 1 ) end_POSTSUPERSCRIPT , overbold_^ start_ARG bold_italic_z end_ARG start_POSTSUPERSCRIPT ( italic_k - 1 ) end_POSTSUPERSCRIPT ) = (14)
𝑿(k+1),𝑿m(k),𝒛(k).superscript𝑿𝑘1superscriptsubscript𝑿𝑚𝑘superscript𝒛𝑘\displaystyle\bm{X}^{(k+1)},\bm{X}_{m}^{(k)},\bm{z}^{(k)}.bold_italic_X start_POSTSUPERSCRIPT ( italic_k + 1 ) end_POSTSUPERSCRIPT , bold_italic_X start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT , bold_italic_z start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT . (15)

The function, DeleteColumns(,,)DeleteColumns\textbf{DeleteColumns}(\cdot,\cdot,\cdot)DeleteColumns ( ⋅ , ⋅ , ⋅ ), removes feature columns from the datasets 𝑿(k)superscript𝑿𝑘\bm{X}^{(k)}bold_italic_X start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT, 𝑿m(k1)superscriptsubscript𝑿𝑚𝑘1\bm{X}_{m}^{(k-1)}bold_italic_X start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k - 1 ) end_POSTSUPERSCRIPT and 𝒛(k1)superscript𝒛𝑘1\bm{z}^{(k-1)}bold_italic_z start_POSTSUPERSCRIPT ( italic_k - 1 ) end_POSTSUPERSCRIPT based on the temporal binary mask 𝒛^(k1)superscriptbold-^𝒛𝑘1\bm{\hat{z}}^{(k-1)}overbold_^ start_ARG bold_italic_z end_ARG start_POSTSUPERSCRIPT ( italic_k - 1 ) end_POSTSUPERSCRIPT. The mask 𝒛𝒛\bm{z}bold_italic_z is stable if:

𝒛(k)=𝒛(k1),superscript𝒛𝑘superscript𝒛𝑘1\bm{z}^{(k)}=\bm{z}^{(k-1)},bold_italic_z start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT = bold_italic_z start_POSTSUPERSCRIPT ( italic_k - 1 ) end_POSTSUPERSCRIPT ,

for a predefined number of consecutive iterations β𝛽\betaitalic_β, which is again adjusted by cross-validation and only affects the computation time of the algorithm. If the mask remains unchanged for a predefined number of consecutive iterations, represented by β𝛽\betaitalic_β, the algorithm determines that it reached an optimal set of features and either proceeds to the next (k+1)𝑘1(k+1)( italic_k + 1 )-th model optimization accordingly or terminates in between lines 35 and 38.

The complete description of the algorithm can be found in Algorithm 1.

Algorithm 1 Adaptive Feature Selection with Binary Masking (AFS-BM)
1:Initialize μ,β+;Δ+formulae-sequence𝜇𝛽superscriptΔsuperscript\mu,\beta\in\mathbb{Z}^{+};\Delta\mathcal{L}\in\mathbb{R}^{+}italic_μ , italic_β ∈ blackboard_Z start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ; roman_Δ caligraphic_L ∈ blackboard_R start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT
2:𝒟(0)={(𝒙i,yi)}i=1N,𝒟m(0)={(𝒙i(m),yi(m))}i=1Pformulae-sequencesuperscript𝒟0superscriptsubscriptsubscript𝒙𝑖subscript𝑦𝑖𝑖1𝑁superscriptsubscript𝒟𝑚0superscriptsubscriptsuperscriptsubscript𝒙𝑖𝑚superscriptsubscript𝑦𝑖𝑚𝑖1𝑃\mathcal{D}^{(0)}=\{(\bm{x}_{i},y_{i})\}_{i=1}^{N},\mathcal{D}_{m}^{(0)}=\{(% \bm{x}_{i}^{(m)},y_{i}^{(m)})\}_{i=1}^{P}caligraphic_D start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT = { ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT , caligraphic_D start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT = { ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT
3:𝑿(0)=[𝒙1,,𝒙N]T,𝑿m(0)=[𝒙1,,𝒙P]Tformulae-sequencesuperscript𝑿0superscriptsubscript𝒙1subscript𝒙𝑁𝑇superscriptsubscript𝑿𝑚0superscriptsubscript𝒙1subscript𝒙𝑃𝑇\bm{X}^{(0)}=[\bm{x}_{1},\ldots,\bm{x}_{N}]^{T},\bm{X}_{m}^{(0)}=[\bm{x}_{1},% \ldots,\bm{x}_{P}]^{T}bold_italic_X start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT = [ bold_italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_italic_x start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT , bold_italic_X start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT = [ bold_italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_italic_x start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT
4:𝒛(0){0,1}Msuperscript𝒛0superscript01𝑀\bm{z}^{(0)}\in\{0,1\}^{M}bold_italic_z start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ∈ { 0 , 1 } start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT
5:while stop criterion β0𝛽0\beta\neq 0italic_β ≠ 0 do
6:    Initialize Fk(𝑿(k),𝜽)subscript𝐹𝑘superscript𝑿𝑘𝜽F_{k}(\bm{X}^{(k)},\bm{\theta})italic_F start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_X start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT , bold_italic_θ )
7:    𝒙imasked=𝒙i𝒛(k1)superscriptsubscript𝒙𝑖𝑚𝑎𝑠𝑘𝑒𝑑direct-productsubscript𝒙𝑖superscript𝒛𝑘1\bm{x}_{i}^{masked}=\bm{x}_{i}\odot\bm{z}^{(k-1)}bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m italic_a italic_s italic_k italic_e italic_d end_POSTSUPERSCRIPT = bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⊙ bold_italic_z start_POSTSUPERSCRIPT ( italic_k - 1 ) end_POSTSUPERSCRIPT
8:    𝒟(k)={(𝒙imasked,yi)}i=1Nsuperscript𝒟𝑘superscriptsubscriptsuperscriptsubscript𝒙𝑖𝑚𝑎𝑠𝑘𝑒𝑑subscript𝑦𝑖𝑖1𝑁\mathcal{D}^{(k)}=\{(\bm{x}_{i}^{masked},y_{i})\}_{i=1}^{N}caligraphic_D start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT = { ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m italic_a italic_s italic_k italic_e italic_d end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT
9:    𝑿(k)=[𝒙1masked,,𝒙Nmasked]Tsuperscript𝑿𝑘superscriptsubscriptsuperscript𝒙𝑚𝑎𝑠𝑘𝑒𝑑1subscriptsuperscript𝒙𝑚𝑎𝑠𝑘𝑒𝑑𝑁𝑇\bm{X}^{(k)}=[\bm{x}^{masked}_{1},\ldots,\bm{x}^{masked}_{N}]^{T}bold_italic_X start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT = [ bold_italic_x start_POSTSUPERSCRIPT italic_m italic_a italic_s italic_k italic_e italic_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_italic_x start_POSTSUPERSCRIPT italic_m italic_a italic_s italic_k italic_e italic_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT
10:    𝜽^k=argmin𝜽(𝒚,Fk(𝑿(k),𝜽))subscriptbold-^𝜽𝑘subscriptargmin𝜽𝒚subscript𝐹𝑘superscript𝑿𝑘𝜽\bm{\hat{\theta}}_{k}=\operatorname*{arg\,min}_{\bm{\theta}}\mathcal{L}(\bm{y}% ,F_{k}(\bm{X}^{(k)},\bm{\theta}))overbold_^ start_ARG bold_italic_θ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT caligraphic_L ( bold_italic_y , italic_F start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_X start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT , bold_italic_θ ) )
11:    𝒙i(m)masked=𝒙i(m)𝒛(k1)superscriptsuperscriptsubscript𝒙𝑖𝑚𝑚𝑎𝑠𝑘𝑒𝑑direct-productsuperscriptsubscript𝒙𝑖𝑚superscript𝒛𝑘1{\bm{x}_{i}^{(m)}}^{masked}=\bm{x}_{i}^{(m)}\odot\bm{z}^{(k-1)}bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT italic_m italic_a italic_s italic_k italic_e italic_d end_POSTSUPERSCRIPT = bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ⊙ bold_italic_z start_POSTSUPERSCRIPT ( italic_k - 1 ) end_POSTSUPERSCRIPT
12:    𝒟m(k1)={(𝒙i(m)masked,yi(m))}i=1Psuperscriptsubscript𝒟𝑚𝑘1superscriptsubscriptsuperscriptsuperscriptsubscript𝒙𝑖𝑚𝑚𝑎𝑠𝑘𝑒𝑑superscriptsubscript𝑦𝑖𝑚𝑖1𝑃\mathcal{D}_{m}^{(k-1)}=\{({\bm{x}_{i}^{(m)}}^{masked},y_{i}^{(m)})\}_{i=1}^{P}caligraphic_D start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k - 1 ) end_POSTSUPERSCRIPT = { ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT italic_m italic_a italic_s italic_k italic_e italic_d end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT
13:    𝑿m(k1)=[𝒙1(m),,𝒙P(m)]Tsuperscriptsubscript𝑿𝑚𝑘1superscriptsuperscriptsubscript𝒙1𝑚superscriptsubscript𝒙𝑃𝑚𝑇\bm{X}_{m}^{(k-1)}=[\bm{x}_{1}^{(m)},\ldots,\bm{x}_{P}^{(m)}]^{T}bold_italic_X start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k - 1 ) end_POSTSUPERSCRIPT = [ bold_italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT , … , bold_italic_x start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT
14:    th=(𝒚(m),Fk(𝑿m(k1),𝜽^k))subscript𝑡superscript𝒚𝑚subscript𝐹𝑘superscriptsubscript𝑿𝑚𝑘1subscriptbold-^𝜽𝑘\mathcal{L}_{th}=\mathcal{L}(\bm{y}^{(m)},F_{k}(\bm{X}_{m}^{(k-1)},\bm{\hat{% \theta}}_{k}))caligraphic_L start_POSTSUBSCRIPT italic_t italic_h end_POSTSUBSCRIPT = caligraphic_L ( bold_italic_y start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT , italic_F start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_X start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k - 1 ) end_POSTSUPERSCRIPT , overbold_^ start_ARG bold_italic_θ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) )
15:    Define 𝒛^(k1)𝒛(k1)superscriptbold-^𝒛𝑘1superscript𝒛𝑘1\bm{\hat{z}}^{(k-1)}\leftarrow\bm{z}^{(k-1)}overbold_^ start_ARG bold_italic_z end_ARG start_POSTSUPERSCRIPT ( italic_k - 1 ) end_POSTSUPERSCRIPT ← bold_italic_z start_POSTSUPERSCRIPT ( italic_k - 1 ) end_POSTSUPERSCRIPT
16:    Initialize set of available indices: 𝒮𝒮\mathcal{S}caligraphic_S
17:    while stop criterion μ0𝜇0\mu\neq 0italic_μ ≠ 0 do
18:         Randomly select i𝑖iitalic_i from 𝒮𝒮\mathcal{S}caligraphic_S
19:         Remove i𝑖iitalic_i from 𝒮𝒮\mathcal{S}caligraphic_S
20:         z^i(k1)=0superscriptsubscript^𝑧𝑖𝑘10\hat{z}_{i}^{(k-1)}=0over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k - 1 ) end_POSTSUPERSCRIPT = 0
21:         𝒙^i(m)masked=𝒙i(m)𝒛^(k1)superscriptsubscriptbold-^𝒙𝑖superscript𝑚𝑚𝑎𝑠𝑘𝑒𝑑direct-productsuperscriptsubscript𝒙𝑖𝑚superscriptbold-^𝒛𝑘1{\bm{\hat{x}}_{i}^{(m)^{masked}}}=\bm{x}_{i}^{(m)}\odot\bm{\hat{z}}^{(k-1)}overbold_^ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_m ) start_POSTSUPERSCRIPT italic_m italic_a italic_s italic_k italic_e italic_d end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT = bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ⊙ overbold_^ start_ARG bold_italic_z end_ARG start_POSTSUPERSCRIPT ( italic_k - 1 ) end_POSTSUPERSCRIPT
22:         𝒟^m(k)={(𝒙^i(m)masked,yi(m))}i=1Psuperscriptsubscript^𝒟𝑚𝑘superscriptsubscriptsuperscriptsubscriptbold-^𝒙𝑖superscript𝑚𝑚𝑎𝑠𝑘𝑒𝑑superscriptsubscript𝑦𝑖𝑚𝑖1𝑃\mathcal{\hat{D}}_{m}^{(k)}=\{({\bm{\hat{x}}_{i}^{(m)^{masked}}},y_{i}^{(m)})% \}_{i=1}^{P}over^ start_ARG caligraphic_D end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT = { ( overbold_^ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_m ) start_POSTSUPERSCRIPT italic_m italic_a italic_s italic_k italic_e italic_d end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT
23:         𝑿^m(t)=[𝒙^1(m)masked,,𝒙^P(m)masked]Tsuperscriptsubscriptbold-^𝑿𝑚𝑡superscriptsuperscriptsubscriptbold-^𝒙1superscript𝑚𝑚𝑎𝑠𝑘𝑒𝑑superscriptsubscriptbold-^𝒙𝑃superscript𝑚𝑚𝑎𝑠𝑘𝑒𝑑𝑇\bm{\hat{X}}_{m}^{(t)}=[\bm{\hat{x}}_{1}^{(m)^{masked}},\ldots,\bm{\hat{x}}_{P% }^{(m)^{masked}}]^{T}overbold_^ start_ARG bold_italic_X end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT = [ overbold_^ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_m ) start_POSTSUPERSCRIPT italic_m italic_a italic_s italic_k italic_e italic_d end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT , … , overbold_^ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_m ) start_POSTSUPERSCRIPT italic_m italic_a italic_s italic_k italic_e italic_d end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT
24:         𝑿m(k1){i}=𝑿^m(k)superscriptsubscript𝑿𝑚𝑘1𝑖superscriptsubscriptbold-^𝑿𝑚𝑘\bm{X}_{m}^{(k-1)}\setminus\{i\}=\bm{\hat{X}}_{m}^{(k)}bold_italic_X start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k - 1 ) end_POSTSUPERSCRIPT ∖ { italic_i } = overbold_^ start_ARG bold_italic_X end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT
25:         mask=(𝒚(m),Fk(𝑿m(t1){i},𝜽^t))subscript𝑚𝑎𝑠𝑘superscript𝒚𝑚subscript𝐹𝑘superscriptsubscript𝑿𝑚𝑡1𝑖subscriptbold-^𝜽𝑡\mathcal{L}_{mask}=\mathcal{L}(\bm{y}^{(m)},F_{k}(\bm{X}_{m}^{(t-1)}\setminus% \{i\},\bm{\hat{\theta}}_{t}))caligraphic_L start_POSTSUBSCRIPT italic_m italic_a italic_s italic_k end_POSTSUBSCRIPT = caligraphic_L ( bold_italic_y start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT , italic_F start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_X start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t - 1 ) end_POSTSUPERSCRIPT ∖ { italic_i } , overbold_^ start_ARG bold_italic_θ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) )
26:         if maskbestbestΔsubscript𝑚𝑎𝑠𝑘subscript𝑏𝑒𝑠𝑡subscript𝑏𝑒𝑠𝑡Δ\frac{\mathcal{L}_{mask}-\mathcal{L}_{best}}{\mathcal{L}_{best}}\leq\Delta% \mathcal{L}divide start_ARG caligraphic_L start_POSTSUBSCRIPT italic_m italic_a italic_s italic_k end_POSTSUBSCRIPT - caligraphic_L start_POSTSUBSCRIPT italic_b italic_e italic_s italic_t end_POSTSUBSCRIPT end_ARG start_ARG caligraphic_L start_POSTSUBSCRIPT italic_b italic_e italic_s italic_t end_POSTSUBSCRIPT end_ARG ≤ roman_Δ caligraphic_L then
27:             z^i(k1)0superscriptsubscript^𝑧𝑖𝑘10\hat{z}_{i}^{(k-1)}\leftarrow 0over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k - 1 ) end_POSTSUPERSCRIPT ← 0
28:             thmasksubscript𝑡subscript𝑚𝑎𝑠𝑘\mathcal{L}_{th}\leftarrow\mathcal{L}_{mask}caligraphic_L start_POSTSUBSCRIPT italic_t italic_h end_POSTSUBSCRIPT ← caligraphic_L start_POSTSUBSCRIPT italic_m italic_a italic_s italic_k end_POSTSUBSCRIPT
29:         else
30:             z^i(k1)1superscriptsubscript^𝑧𝑖𝑘11\hat{z}_{i}^{(k-1)}\leftarrow 1over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k - 1 ) end_POSTSUPERSCRIPT ← 1
31:             μμ1𝜇𝜇1\mu\leftarrow\mu-1italic_μ ← italic_μ - 1
32:         end if
33:    end while
34:    DeleteColumns(𝑿(k),𝑿m(k1),𝒛^(k1))=𝑿(k+1),𝑿m(k),𝒛(k)DeleteColumnssuperscript𝑿𝑘superscriptsubscript𝑿𝑚𝑘1superscriptbold-^𝒛𝑘1superscript𝑿𝑘1superscriptsubscript𝑿𝑚𝑘superscript𝒛𝑘\textbf{DeleteColumns}(\bm{X}^{(k)},\bm{X}_{m}^{(k-1)},\bm{\hat{z}}^{(k-1)})=% \bm{X}^{(k+1)},\bm{X}_{m}^{(k)},\bm{z}^{(k)}DeleteColumns ( bold_italic_X start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT , bold_italic_X start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k - 1 ) end_POSTSUPERSCRIPT , overbold_^ start_ARG bold_italic_z end_ARG start_POSTSUPERSCRIPT ( italic_k - 1 ) end_POSTSUPERSCRIPT ) = bold_italic_X start_POSTSUPERSCRIPT ( italic_k + 1 ) end_POSTSUPERSCRIPT , bold_italic_X start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT , bold_italic_z start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT
35:    if 𝒛(k1)=𝒛(k)superscript𝒛𝑘1superscript𝒛𝑘\bm{z}^{(k-1)}=\bm{z}^{(k)}bold_italic_z start_POSTSUPERSCRIPT ( italic_k - 1 ) end_POSTSUPERSCRIPT = bold_italic_z start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT then
36:         ββ1𝛽𝛽1\beta\leftarrow\beta-1italic_β ← italic_β - 1
37:    end if
38:end while

5 Simulations

In this study, we aim to compare the AFS-BM algorithm with the widely used feature selection approaches over the well-established real-life competition datasets. The efficacy of AFS-BM is demonstrated across both regression tasks, with Gradient Boosting Machines (GBMs) and Neural Networks (NNs) serving as the underlying models. Note that our algorithm is generic can be applied to other machine learning algorithms, however, we concentrate on these two since they are widely used in the literature.

5.1 Implementation and Evaluation of AFS-BM

Our experimental framework includes both synthetic and well-known competition datasets, and regression tasks. For regression analyses, first, we evaluate the performance of the AFS-BM algorithm compared to the other feature selection algorithms on the synthetic dataset. We start with a synthetic dataset to perform a controlled environment that allows us to thoroughly evaluate the algorithm’s behavior and effectiveness in scenarios where the ground truth is known. This initial step provides valuable insights into the algorithm’s capabilities before applying it to real-world datasets. Then, we continue with the real-life datasets including the M4 Forecasting Competition Dataset [22] and The Istanbul Stock Exchange Hourly Dataset [23]. A crucial aspect of our study is performance benchmarking, where we compare the results of our AFS-BM feature selection algorithm against other prevalent methods in the field. Using a validation set, we optimize all hyperparameters for LightGBM, XGBoost, MLP, and the feature selection methods to remove any spurious or coincidental correlations that might arise from overfitting. The hyperparameter search space is also cross-validated.

5.1.1 Regression Experiments and Results

Our regression experiments use the daily series of the M4 Forecasting Competition Dataset [22] and the Istanbul Stock Exchange Hourly Dataset [23]. We will first configure the AFS-BM algorithm for each dataset and then compare its performance against traditional feature selection methods. The experiments will also explore the impact of hyperparameter tuning on AFS-BM’s accuracy, concluding with insights into its overall effectiveness.

Time Series Feature Engineering and Algorithmic Evaluation on the Daily M4 Forecasting Dataset

To predict the target value ytsubscript𝑦𝑡y_{t}italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, we utilize the daily series from the M4 Forecasting Competition Dataset, which initially contains only the target values ytsubscript𝑦𝑡y_{t}italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. For feature engineering, we consider the first three lags of ytsubscript𝑦𝑡y_{t}italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT as yt1subscript𝑦𝑡1y_{t-1}italic_y start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT, yt2subscript𝑦𝑡2y_{t-2}italic_y start_POSTSUBSCRIPT italic_t - 2 end_POSTSUBSCRIPT, and yt3subscript𝑦𝑡3y_{t-3}italic_y start_POSTSUBSCRIPT italic_t - 3 end_POSTSUBSCRIPT, and the 7th, 14th, and 28th lags as yt7subscript𝑦𝑡7y_{t-7}italic_y start_POSTSUBSCRIPT italic_t - 7 end_POSTSUBSCRIPT, yt14subscript𝑦𝑡14y_{t-14}italic_y start_POSTSUBSCRIPT italic_t - 14 end_POSTSUBSCRIPT, and yt28subscript𝑦𝑡28y_{t-28}italic_y start_POSTSUBSCRIPT italic_t - 28 end_POSTSUBSCRIPT respectively. Beyond these basic lag features, we compute rolling statistics, specifically, the rolling mean and rolling standard deviation with window sizes of 4, 7, and 28 until ytsubscript𝑦𝑡y_{t}italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT.

To remove any bias to a particular time series, we select 100 random series from the dataset, normalize them to [1,1]11[-1,1][ - 1 , 1 ] after mean removal. Since the data length is not uniform for each series, we reserve the last 10%percent1010\%10 % of total samples for testing, 20%percent2020\%20 % of total samples before the testing as mask validation, and 20%percent2020\%20 % of total samples before the mask validation as model validation. To underscore the significance of feature selection, we utilize the default configurations of LightGBM, XGBoost, and Multi-Layer Perceptron (MLP) without incorporating any specific feature selection techniques. We compare our algorithm against the well-known feature selection methods, Cross-Correlation, Mutual Information, and RFE. Using a validation set, we optimize all hyperparameters for LightGBM, XGBoost, MLP, and the feature selection methods. Both our introduced AFS-BM algorithm and the other feature selection methods employ LightGBM and XGBoost as their primary boosting techniques.

For each randomly selected time series, denoted as yt(s)superscriptsubscript𝑦𝑡𝑠y_{t}^{(s)}italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_s ) end_POSTSUPERSCRIPT, where s=1,,100𝑠1100s=1,\ldots,100italic_s = 1 , … , 100, we apply all the algorithms for feature selection and then compute the test loss. The test loss are given by lt(s)superscriptsubscript𝑙𝑡𝑠l_{t}^{(s)}italic_l start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_s ) end_POSTSUPERSCRIPT. Here, lt(s)(yt(s)y^t(s))2superscriptsubscript𝑙𝑡𝑠superscriptsuperscriptsubscript𝑦𝑡𝑠subscriptsuperscript^𝑦𝑠𝑡2l_{t}^{(s)}\triangleq(y_{t}^{(s)}-\hat{y}^{(s)}_{t})^{2}italic_l start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_s ) end_POSTSUPERSCRIPT ≜ ( italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_s ) end_POSTSUPERSCRIPT - over^ start_ARG italic_y end_ARG start_POSTSUPERSCRIPT ( italic_s ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, and t𝑡titalic_t ranges from 1 up to tmaxsubscript𝑡𝑚𝑎𝑥t_{max}italic_t start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT, the latter being the longest time duration among all yt(s)superscriptsubscript𝑦𝑡𝑠y_{t}^{(s)}italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_s ) end_POSTSUPERSCRIPT for k=1,,100𝑘1100k=1,\ldots,100italic_k = 1 , … , 100. For datasets with shorter durations, we pad the loss sequences with zeros at the end to ensure consistency. To eliminate any bias towards a specific sequence, we compute the average of these loss sequences, yielding the final averaged squared error: lt(ave)=1100s=1100lt(s)superscriptsubscript𝑙𝑡𝑎𝑣𝑒1100superscriptsubscript𝑠1100superscriptsubscript𝑙𝑡𝑠l_{t}^{(ave)}=\frac{1}{100}\sum_{s=1}^{100}l_{t}^{(s)}italic_l start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_a italic_v italic_e ) end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG 100 end_ARG ∑ start_POSTSUBSCRIPT italic_s = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 100 end_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_s ) end_POSTSUPERSCRIPT. For further refinement, we also average over time, resulting in lt(ave2)=1tj=1tlj(ave)superscriptsubscript𝑙𝑡𝑎𝑣𝑒21𝑡superscriptsubscript𝑗1𝑡superscriptsubscript𝑙𝑗𝑎𝑣𝑒l_{t}^{(ave2)}=\frac{1}{t}\sum_{j=1}^{t}l_{j}^{(ave)}italic_l start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_a italic_v italic_e 2 ) end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_t end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_a italic_v italic_e ) end_POSTSUPERSCRIPT. The outcomes for the LightGBM-based algorithms are depicted in Fig. 1, while those for the XGBoost-based algorithms are illustrated in Fig. 2.

Refer to caption
Figure 1: Comparison of averaged loss over time, lt(ave2)superscriptsubscript𝑙𝑡𝑎𝑣𝑒2l_{t}^{(ave2)}italic_l start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_a italic_v italic_e 2 ) end_POSTSUPERSCRIPT, for the experiments on M4 Competition data with LightGBM-based algorithms.
Refer to caption
Figure 2: Comparison of averaged loss over time for the experiments on M4 Competition data with XGBoost-based algorithms.

Additionally, Table 1 offers a detailed comparison of our algorithm against the standard LightGBM, XGBoost, and their respective implementations with Cross-Correlation, Mutual Information, and RFE algorithms.

Algorithm\Base Model LightGBM XGBoost
Cross-Correlation Algorithm 4.7895×102 times4.7895E-2absent4.7895\text{\times}{10}^{-2}\text{\,}start_ARG start_ARG 4.7895 end_ARG start_ARG times end_ARG start_ARG power start_ARG 10 end_ARG start_ARG - 2 end_ARG end_ARG end_ARG start_ARG times end_ARG start_ARG end_ARG 9.1995×102 times9.1995E-2absent9.1995\text{\times}{10}^{-2}\text{\,}start_ARG start_ARG 9.1995 end_ARG start_ARG times end_ARG start_ARG power start_ARG 10 end_ARG start_ARG - 2 end_ARG end_ARG end_ARG start_ARG times end_ARG start_ARG end_ARG
Mutual Information Algorithm 4.6626×102 times4.6626E-2absent4.6626\text{\times}{10}^{-2}\text{\,}start_ARG start_ARG 4.6626 end_ARG start_ARG times end_ARG start_ARG power start_ARG 10 end_ARG start_ARG - 2 end_ARG end_ARG end_ARG start_ARG times end_ARG start_ARG end_ARG 9.0114×102 times9.0114E-2absent9.0114\text{\times}{10}^{-2}\text{\,}start_ARG start_ARG 9.0114 end_ARG start_ARG times end_ARG start_ARG power start_ARG 10 end_ARG start_ARG - 2 end_ARG end_ARG end_ARG start_ARG times end_ARG start_ARG end_ARG
RFE Algorithm 4.5212×102 times4.5212E-2absent4.5212\text{\times}{10}^{-2}\text{\,}start_ARG start_ARG 4.5212 end_ARG start_ARG times end_ARG start_ARG power start_ARG 10 end_ARG start_ARG - 2 end_ARG end_ARG end_ARG start_ARG times end_ARG start_ARG end_ARG 8.8511×102 times8.8511E-2absent8.8511\text{\times}{10}^{-2}\text{\,}start_ARG start_ARG 8.8511 end_ARG start_ARG times end_ARG start_ARG power start_ARG 10 end_ARG start_ARG - 2 end_ARG end_ARG end_ARG start_ARG times end_ARG start_ARG end_ARG
Vanilla Algorithm 4.7691×102 times4.7691E-2absent4.7691\text{\times}{10}^{-2}\text{\,}start_ARG start_ARG 4.7691 end_ARG start_ARG times end_ARG start_ARG power start_ARG 10 end_ARG start_ARG - 2 end_ARG end_ARG end_ARG start_ARG times end_ARG start_ARG end_ARG 8.9000×102 times8.9000E-2absent8.9000\text{\times}{10}^{-2}\text{\,}start_ARG start_ARG 8.9000 end_ARG start_ARG times end_ARG start_ARG power start_ARG 10 end_ARG start_ARG - 2 end_ARG end_ARG end_ARG start_ARG times end_ARG start_ARG end_ARG
AFS-BM Algorithm 3.8251×𝟏𝟎𝟐3.8251superscript102\bm{3.8251\times 10^{-2}}bold_3.8251 bold_× bold_10 start_POSTSUPERSCRIPT bold_- bold_2 end_POSTSUPERSCRIPT 5.4094×𝟏𝟎𝟐5.4094superscript102\bm{5.4094\times 10^{-2}}bold_5.4094 bold_× bold_10 start_POSTSUPERSCRIPT bold_- bold_2 end_POSTSUPERSCRIPT
Table 1: Final MSE Results of the Experiments on M4 Competition Daily Dataset with GBM-based Learners

Moreover, we apply the MLP model and other algorithms to the test data for each series, resulting in the loss sequences. The performance trends of the algorithms over time are visualized in Figure 3, which provides a graphical representation of the averaged loss sequences for the experiments on the M4 Competition data with MLP-based algorithms.

Refer to caption
Figure 3: Comparison of averaged loss over time for the experiments on M4 Competition data with MLP-based algorithms.

Additionally, Table 2 offers a detailed comparison of our algorithm against the standard MLP and their respective implementations with Cross-Correlation, Mutual Information, and RFE algorithms.

Algorithm\Base Model MLP
Cross-Correlation Algorithm 4.8052×102 times4.8052E-2absent4.8052\text{\times}{10}^{-2}\text{\,}start_ARG start_ARG 4.8052 end_ARG start_ARG times end_ARG start_ARG power start_ARG 10 end_ARG start_ARG - 2 end_ARG end_ARG end_ARG start_ARG times end_ARG start_ARG end_ARG
Mutual Information Algorithm 4.6781×102 times4.6781E-2absent4.6781\text{\times}{10}^{-2}\text{\,}start_ARG start_ARG 4.6781 end_ARG start_ARG times end_ARG start_ARG power start_ARG 10 end_ARG start_ARG - 2 end_ARG end_ARG end_ARG start_ARG times end_ARG start_ARG end_ARG
Vanilla Algorithm 4.7845×102 times4.7845E-2absent4.7845\text{\times}{10}^{-2}\text{\,}start_ARG start_ARG 4.7845 end_ARG start_ARG times end_ARG start_ARG power start_ARG 10 end_ARG start_ARG - 2 end_ARG end_ARG end_ARG start_ARG times end_ARG start_ARG end_ARG
AFS-BM Algorithm 3.8385×𝟏𝟎𝟐3.8385superscript102\bm{3.8385\times 10^{-2}}bold_3.8385 bold_× bold_10 start_POSTSUPERSCRIPT bold_- bold_2 end_POSTSUPERSCRIPT
Table 2: Final MSE Results of the Experiments on M4 Competition Daily Dataset with MLP
Time Series Feature Engineering and Algorithmic Evaluation on the Istanbul Stock Exchange Dataset

To forecast the target value ytsubscript𝑦𝑡y_{t}italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, we turn to the Istanbul Stock Exchange Dataset[23], which provides hourly data points. Initially, this dataset offers only the target values, denoted as ytsubscript𝑦𝑡y_{t}italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. To enrich the predictive capability of our model, we perform a systematic feature engineering process. Firstly, we extract immediate temporal dependencies by considering the three most recent lags of ytsubscript𝑦𝑡y_{t}italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, represented as yt1subscript𝑦𝑡1y_{t-1}italic_y start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT, yt2subscript𝑦𝑡2y_{t-2}italic_y start_POSTSUBSCRIPT italic_t - 2 end_POSTSUBSCRIPT, and yt3subscript𝑦𝑡3y_{t-3}italic_y start_POSTSUBSCRIPT italic_t - 3 end_POSTSUBSCRIPT. Recognizing the significance of longer-term patterns in hourly data, especially over the course of a day or multiple days, we also incorporate the 24th, 36th, and 48th lags, denoted as yt24subscript𝑦𝑡24y_{t-24}italic_y start_POSTSUBSCRIPT italic_t - 24 end_POSTSUBSCRIPT, yt36subscript𝑦𝑡36y_{t-36}italic_y start_POSTSUBSCRIPT italic_t - 36 end_POSTSUBSCRIPT, and yt48subscript𝑦𝑡48y_{t-48}italic_y start_POSTSUBSCRIPT italic_t - 48 end_POSTSUBSCRIPT respectively. In addition to these lags, we also factor in the cyclical nature of time by taking the sine and cosine values of the timestamp, which includes the month, day, and hour. This helps in capturing the cyclic patterns associated with daily and monthly rhythms.

In addition to these features, we also calculate the rolling statistics to capture more nuanced patterns. Specifically, we calculate the rolling mean and rolling standard deviation for both recent (yt1subscript𝑦𝑡1y_{t-1}italic_y start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT) and daily (yt24subscript𝑦𝑡24y_{t-24}italic_y start_POSTSUBSCRIPT italic_t - 24 end_POSTSUBSCRIPT) lags. These calculations are performed over window sizes of 4 (representing a shorter part of the day), 12 (half a day), and 24 (a full day) hours. Hence, we get a comprehensive set of features from the time series data.

Following mean removal, the dataset is first normalized to [1,1]11[-1,1][ - 1 , 1 ]. Given the dataset, 10%percent1010\%10 % of total samples are designated for testing. Furthermore, 20%percent2020\%20 % of the samples leading up to the test set are reserved for mask validation, while an additional 20%percent2020\%20 % preceding the mask validation is allocated for model validation. LightGBM, XGBoost, and MLP are utilized in their standard configurations, bypassing specialized feature selection methods. Our algorithm is compared with the algorithms that are the same as the previous test sets. All hyperparameters associated with LightGBM, XGBoost, and MLP undergo optimization based on the model validation set to ensure optimal performance.

The findings based on Mean Squared Error (MSE) loss and the number of selected features for both LightGBM and XGBoost-based algorithms are displayed in Table 3 and Table 4.

Algorithm\Base Model LightGBM XGBoost
Cross-Correlation Algorithm 2.5564×102 times2.5564E-2absent2.5564\text{\times}{10}^{-2}\text{\,}start_ARG start_ARG 2.5564 end_ARG start_ARG times end_ARG start_ARG power start_ARG 10 end_ARG start_ARG - 2 end_ARG end_ARG end_ARG start_ARG times end_ARG start_ARG end_ARG 2.5876×102 times2.5876E-2absent2.5876\text{\times}{10}^{-2}\text{\,}start_ARG start_ARG 2.5876 end_ARG start_ARG times end_ARG start_ARG power start_ARG 10 end_ARG start_ARG - 2 end_ARG end_ARG end_ARG start_ARG times end_ARG start_ARG end_ARG
Mutual Information Algorithm 2.8480×102 times2.8480E-2absent2.8480\text{\times}{10}^{-2}\text{\,}start_ARG start_ARG 2.8480 end_ARG start_ARG times end_ARG start_ARG power start_ARG 10 end_ARG start_ARG - 2 end_ARG end_ARG end_ARG start_ARG times end_ARG start_ARG end_ARG 2.5536×102 times2.5536E-2absent2.5536\text{\times}{10}^{-2}\text{\,}start_ARG start_ARG 2.5536 end_ARG start_ARG times end_ARG start_ARG power start_ARG 10 end_ARG start_ARG - 2 end_ARG end_ARG end_ARG start_ARG times end_ARG start_ARG end_ARG
RFE Algorithm 4.5903×103 times4.5903E-3absent4.5903\text{\times}{10}^{-3}\text{\,}start_ARG start_ARG 4.5903 end_ARG start_ARG times end_ARG start_ARG power start_ARG 10 end_ARG start_ARG - 3 end_ARG end_ARG end_ARG start_ARG times end_ARG start_ARG end_ARG 5.8105×103 times5.8105E-3absent5.8105\text{\times}{10}^{-3}\text{\,}start_ARG start_ARG 5.8105 end_ARG start_ARG times end_ARG start_ARG power start_ARG 10 end_ARG start_ARG - 3 end_ARG end_ARG end_ARG start_ARG times end_ARG start_ARG end_ARG
Vanilla Algorithm 4.3225×103 times4.3225E-3absent4.3225\text{\times}{10}^{-3}\text{\,}start_ARG start_ARG 4.3225 end_ARG start_ARG times end_ARG start_ARG power start_ARG 10 end_ARG start_ARG - 3 end_ARG end_ARG end_ARG start_ARG times end_ARG start_ARG end_ARG 5.5866×103 times5.5866E-3absent5.5866\text{\times}{10}^{-3}\text{\,}start_ARG start_ARG 5.5866 end_ARG start_ARG times end_ARG start_ARG power start_ARG 10 end_ARG start_ARG - 3 end_ARG end_ARG end_ARG start_ARG times end_ARG start_ARG end_ARG
AFS-BM Algorithm 3.8350×𝟏𝟎𝟑3.8350superscript103\bm{3.8350\times 10^{-3}}bold_3.8350 bold_× bold_10 start_POSTSUPERSCRIPT bold_- bold_3 end_POSTSUPERSCRIPT 3.9445×𝟏𝟎𝟑3.9445superscript103\bm{3.9445\times 10^{-3}}bold_3.9445 bold_× bold_10 start_POSTSUPERSCRIPT bold_- bold_3 end_POSTSUPERSCRIPT
Table 3: Final MSE Results of the Experiments on Istanbul Stock Exchange Hourly Dataset with GBM-based Learners
Base Model Method Selected Features
LightGBM AFS-BM 2
XGBoost AFS-BM 8
LightGBM Cross-Correlation 66
XGBoost Cross-Correlation 66
LightGBM Mutual Information 97
XGBoost Mutual Information 101
LightGBM RFE 10
XGBoost RFE 5
Table 4: The number of selected features selected from a total of 176 features by each feature selection method on the Istanbul Stock Exchange Hourly Dataset with GBM-based Learners

Furthermore, the dataset undergoes evaluation using the MLP model. MLP, with its layered architecture, is adept at capturing nonlinear relationships in the data, making it a suitable choice for such complex datasets. The results based on Mean Squared Error (MSE) loss for the MLP-based algorithms are presented in Table 5.

Algorithm\Base Model MLP
Cross-Correlation Algorithm 2.5652×102 times2.5652E-2absent2.5652\text{\times}{10}^{-2}\text{\,}start_ARG start_ARG 2.5652 end_ARG start_ARG times end_ARG start_ARG power start_ARG 10 end_ARG start_ARG - 2 end_ARG end_ARG end_ARG start_ARG times end_ARG start_ARG end_ARG
Mutual Information Algorithm 2.8603×102 times2.8603E-2absent2.8603\text{\times}{10}^{-2}\text{\,}start_ARG start_ARG 2.8603 end_ARG start_ARG times end_ARG start_ARG power start_ARG 10 end_ARG start_ARG - 2 end_ARG end_ARG end_ARG start_ARG times end_ARG start_ARG end_ARG
Vanilla Algorithm 4.3350×103 times4.3350E-3absent4.3350\text{\times}{10}^{-3}\text{\,}start_ARG start_ARG 4.3350 end_ARG start_ARG times end_ARG start_ARG power start_ARG 10 end_ARG start_ARG - 3 end_ARG end_ARG end_ARG start_ARG times end_ARG start_ARG end_ARG
AFS-BM Algorithm 3.8423×𝟏𝟎𝟑3.8423superscript103\bm{3.8423\times 10^{-3}}bold_3.8423 bold_× bold_10 start_POSTSUPERSCRIPT bold_- bold_3 end_POSTSUPERSCRIPT
Table 5: Final MSE Results of the Experiments on Istanbul Stock Exchange Hourly Dataset with MLP

A comparative analysis reveals that the AFS-BM algorithm consistently surpasses other methods across all regression experiments. The main strength of AFS-BM is its ability to adapt its feature selection as data patterns change. It does this by checking the error at every step. This adaptability is different from fixed methods such as Cross-Correlation and Mutual Information. AFS-BM can effectively handle both short-term and long-term changes in time series data. When combined with cross-validated hyperparameter selection, the algorithm works at its best. In summary, our experimental results demonstrate the superior performance of AFS-BM algorithm over the state-of-the art feature selection methods.

6 Conclusion

In this paper, we have addressed the critical problem of feature selection in general machine learning models, recognizing its critical role in both model accuracy and efficiency. Traditional feature selection methods struggle with significant challenges such as scalability, managing high-dimensional data, handling correlated features, adapting to shifting feature importance, and integrating domain knowledge. To this end, we introduced the “Adaptive Feature Selection with Binary Masking” (AFS-BM) algorithm.

AFS-BM differs from the current algorithms since it uses joint optimization for both dynamic and stochastic feature selection and model training, a method that involves binary masking. This approach enables the algorithm to adapt continuously by adjusting features and model parameters during training, responding to changes in feature importance measured by the loss metric values. This adaptability ensures that AFS-BM retains essential features while discarding less relevant ones based on evaluation results, rather than relying solely on feature importance calculated using intuitive reasoning. This is crucial since depending solely on feature importance can lead to the removal of informative features, as it evaluates them within the context of the feature set, rather than considering their contribution to final model performance. AFS-BM’s approach prevents the removal of informative features, ultimately enhancing model accuracy. To encourage further research and allow others to replicate our results, we openly share our source code444https://github.com/YigitTurali/AFS_BM-Algorithm.

Statements and Declarations

  • Competing interests: The authors declare that they have no known competing financial interests or personal relationships that could have influenced influence the work reported in this paper.

  • Availability of data and materials : The data that support the findings of this study is openly available in UCI Machine Learning Repository at https://archive.ics.uci.edu.

References

  • Capobianco [2022] Capobianco, E. High-dimensional role of AI and machine learning in cancer research. British Journal of Cancer 2022, 126, 523–532.
  • Bellman [1961] Bellman, R. E. Adaptive Control Processes: A Guided Tour; Princeton University Press: Princeton, 1961.
  • Bishop [2006] Bishop, C. M. Pattern Recognition and Machine Learning (Information Science and Statistics); Springer-Verlag: Berlin, Heidelberg, 2006.
  • Farmanbar and Toygar [2016] Farmanbar, M.; Toygar, Ö. Feature selection for the fusion of face and palmprint biometrics. Signal, Image and Video Processing 2016, 10, 951–958.
  • Guyon and Elisseeff [2003] Guyon, I.; Elisseeff, A. An Introduction to Variable and Feature Selection. 2003; p 1157–1182.
  • Lajevardi and Hussain [2012] Lajevardi, S. M.; Hussain, Z. M. Automatic facial expression recognition: feature extraction and selection. Signal, Image and Video Processing 2012, 6, 159–169.
  • Peng et al. [2005] Peng, H.; Long, F.; Ding, C. Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Transactions on Pattern Analysis and Machine Intelligence 2005, 27, 1226–1238.
  • Kira and Rendell [1992] Kira, K.; Rendell, L. A. In Machine Learning Proceedings 1992; Sleeman, D., Edwards, P., Eds.; Morgan Kaufmann: San Francisco (CA), 1992; pp 249–256.
  • Rida et al. [2016] Rida, I.; Almaadeed, S.; Bouridane, A. Gait recognition based on modified phase-only correlation. Signal, Image and Video Processing 2016, 10, 463–470.
  • Tibshirani [1996] Tibshirani, R. Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society. Series B (Methodological) 1996, 58, 267–288, Full publication date: 1996.
  • Breiman [2001] Breiman, L. Random Forests. Machine Learning 2001, 45, 5–32.
  • Saeys et al. [2007] Saeys, Y.; Inza, I.; Larrañaga, P. A review of feature selection techniques in bioinformatics. Bioinformatics 2007, 23, 2507–2517.
  • Atan et al. [2019] Atan, O.; Zame, W. R.; Feng, Q.; van der Schaar, M. Constructing effective personalized policies using counterfactual inference from biased data sets with many features. Machine Learning 2019, 108, 945–970.
  • Subasi [2015] Subasi, A. A decision support system for diagnosis of neuromuscular disorders using DWT and evolutionary support vector machines. Signal, Image and Video Processing 2015, 9, 399–408.
  • Ruszczak et al. [2024] Ruszczak, B.; Smykała, K.; Tomaszewski, M.; Navarro Lorente, P. J. Various tomato infection discrimination using spectroscopy. Signal, Image and Video Processing 2024,
  • Chen and Guestrin [2016] Chen, T.; Guestrin, C. Xgboost: A scalable tree boosting system. Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining. 2016; pp 785–794.
  • Goodfellow et al. [2016] Goodfellow, I.; Bengio, Y.; Courville, A. Deep Learning; MIT Press, 2016; http://www.deeplearningbook.org.
  • Junejo et al. [2013] Junejo, I. N.; Bhutta, A. A.; Foroosh, H. Single-class SVM for dynamic scene modeling. Signal, Image and Video Processing 2013, 7, 45–52.
  • Aguiar et al. [2023] Aguiar, G.; Krawczyk, B.; Cano, A. A survey on learning from imbalanced data streams: taxonomy, challenges, empirical study, and reproducible experimental framework. Machine Learning 2023,
  • Zhang et al. [2022] Zhang, N.; Chen, M.; Yang, F.; Yang, C.; Yang, P.; Gao, Y.; Shang, Y.; Peng, D. Forest Height Map** Using Feature Selection and Machine Learning by Integrating Multi-Source Satellite Data in Baoding City, North China. Remote Sensing 2022, 14.
  • Sun and Dai [2018] Sun, C.; Dai, R. Distributed Optimization for Convex Mixed-Integer Programs based on Projected Subgradient Algorithm. 2018 IEEE Conference on Decision and Control (CDC). 2018; pp 2581–2586.
  • Makridakis et al. [2020] Makridakis, S.; Spiliotis, E.; Assimakopoulos, V. The M4 Competition: 100,000 time series and 61 forecasting methods. International Journal of Forecasting 2020, 36, 54–74, M4 Competition.
  • Akbilgic [2013] Akbilgic, O. ISTANBUL STOCK EXCHANGE. UCI Machine Learning Repository, 2013; DOI: https://doi.org/10.24432/C54P4J.