[1]\fnmMehmet Y. \surTurali \equalcontThese authors contributed equally to this work. \equalcontThese authors contributed equally to this work.

[1]\orgdivDepartment of Electrical and Electronics Engineering, \orgnameBilkent University, \orgaddress\cityAnkara, \postcode06800, \countryTurkey

AFS-BM: Enhancing Model Performance through Adaptive Feature Selection with Binary Masking

[email protected] \fnmMehmet E. \surLorasdagi [email protected] \fnmSuleyman S. \surKozat [email protected] *

Abstract

We study the problem of feature selection in general machine learning (ML) context, which is one of the most critical subjects in the field. Although, there exist many feature selection methods, however, these methods face challenges such as scalability, managing high-dimensional data, dealing with correlated features, adapting to variable feature importance, and integrating domain knowledge. To this end, we introduce the “Adaptive Feature Selection with Binary Masking” (AFS-BM) which remedies these problems. AFS-BM achieves this by joint optimization for simultaneous feature selection and model training. In particular, we do the joint optimization and binary masking to continuously adapt the set of features and model parameters during the training process. This approach leads to significant improvements in model accuracy and a reduction in computational requirements. We provide an extensive set of experiments where we compare AFS-BM with the established feature selection methods using well-known datasets from real-life competitions. Our results show that AFS-BM makes significant improvement in terms of accuracy and requires significantly less computational complexity. This is due to AFS-BM’s ability to dynamically adjust to the changing importance of features during the training process, which an important contribution to the field. We openly share our code for the replicability of our results and to facilitate further research.

keywords:

Machine Learning, Feature Selection, Gradient Boosting Machines, Adaptive Optimization, Binary mask, High-Dimensional Datasets

1 Introduction

The emergence of the digital era has enabled a significant increase in data across various domains, including genomics and finance. High-dimensional datasets are valuable in areas such as image processing, genomics, and finance due to the detailed information they provide [1]. These datasets can contribute to the development of highly accurate models. This influx of data, particularly in the form of high-dimensional feature sets, brings both opportunities and challenges for the machine learning domain [2]. High-dimensional data can reveal complex patterns, but it may also lead to the well-known “curse of dimensionality” [2]. The large number of features can sometimes hide true patterns, leading to models that require significant computational resources prone to overfitting and are difficult to interpret [3].

While abundant features can provide rich information, not all features contribute equally to the final task. Some might be redundant, irrelevant, or even detrimental to the performance of the model if there is not enough data to learn from [4]. The challenge lies in identifying which features are essential and which can be discarded without compromising the accuracy of the model with limited data [5, 6]. This problem of feature selection is crucial for both regression and classification tasks; as an example, this happens when the feature dimensions are comparable to the training dataset size or for non-stationary environments, i.e., when we do not have enough data to learn which features are relevant due to changing statistics.

Here, we introduce “Adaptive Feature Selection with Binary Masking” (AFS-BM), a novel method designed for feature selection in high-dimensional datasets and scenarios characterized by non-stationarity. This method is particularly adept at handling cases with insufficient data for reliably identifying relevant features, a challenge it addresses through the integration of a binary mask. This binary mask, represented as a column vector with “1”s for active features and “0”s for inactive ones, plays a pivotal role in our joint optimization framework, seamlessly combining feature selection and model training, which is normally an NP hard optimization problem. AFS-BM performs this by joint adaptation of the binary mask concurrently with the model parameter optimization. This joint and dynamic optimization enables us to fine-tune the feature set concurrently and enhance model training, resulting in a more agile and efficient identification of relevant features. Consequently, it significantly elevates the predictive accuracy of the underlying model. Our comprehensive experimentation across renowned competition datasets demonstrates the robustness and effectiveness of our method. Compared to widely used feature selection techniques, AFS-BM consistently yields substantial performance improvements. Furthermore, we also openly share our code¹¹1https://github.com/YigitTurali/AFS_BM-Algorithm to facilitate further research and replicability of our results.

Our main contributions to the literature can be summarized as follows:

•

We introduce AFS-BM that uniquely combines binary mask feature selection with joint optimization. This integration allows for dynamic feature selection, actively improving model accuracy by focusing on the most relevant features and simultaneously reducing noise by eliminating less significant ones.
•

This binary mask dynamically refine the feature set, playing a crucial role in improving the computational efficiency of the approach.
•

Our algorithm ensures a critical balance between feature selection and maintaining model accuracy by using a well-tuned and iterative approach that evaluates and prunes features based on their final impact on performance, thereby ensuring efficiency without compromising predictive precision.
•

We demonstrate the significant performance improvements over traditional feature selection techniques, particularly in its application to GBMs and NNs.²²2The rest of the paper is organized as follows: Section 2 outlines the current literature of the filter, wrapper, embedded and adaptive feature selection methods. Section 3 presents the mathematical background and problem description with current feature selection methods. Section 4 introduces our novel feature selection structure, detailing the algorithms and their underlying principles. Section 5 showcases our experimental results, comparing our approach with established feature selection techniques. Finally, Section 6 offers a summary of our findings and conclusions.

2 Related Work

Several established methods for feature selection exist. Filter methods such as the Chi-squared test, mutual information [7], and correlation coefficients assess features based on their inherent statistical characteristics [8]. Wrapper methods, including techniques such as sequential forward selection and sequential backward elimination, employ specific machine learning algorithms to evaluate varying feature subsets [9]. Embedded methods, represented by techniques such as LASSO and decision tree-based approaches, integrate feature selection directly into the model training process [10, 11]. However, these methods often depend on static analysis and predetermined algorithms, which can be a limitation. Specifically, methods like wrappers and embedded approaches may require significant computational resources, making them challenging to apply to large datasets [12]. While widely used, these methods often struggle with high-dimensional datasets and lack the flexibility to adapt to the evolving requirements of complex models [13]. In contrast, our method employs a binary mask coupled with iterative optimization, enabling dynamic adaptation to the learning patterns of a model. Since this approach jointly and dynamically adjust to the changing importance of features during the training process, it enhances the precision of feature selection and ensures better performance in challenging scenarios involving high-dimensional or evolving datasets. Additionally, in the context of evolving datasets, where data patterns and relationships may shift over time, the flexibility of the binary mask to adapt to these changes ensures that the model remains robust and relevant. It can adjust to new patterns or discard previously relevant features that become obsolete, maintaining its efficacy in dynamically changing environments. This adaptability is crucial for maintaining high performance over time, especially in real-world applications where data is not static and can evolve.

In recent years, there has been a shift towards adaptive feature selection methods. These methods focus on adapting to the learning behavior of the model. An example of this is the Recursive Feature Elimination (RFE) for Support Vector Machines (SVMs), where features are prioritized based on their influence on the SVM’s decision-making [14, 15]. Tree-based models, such as Random Forests and GBMs, also provide feature importance metrics derived from their tree structures [16]. These existing methods, while adaptive, are not always optimal for high-dimensional and non-stationary data, and they are designed mainly for traditional machine learning models and might not be optimal for different architectures such as for NNs [17]. In contrast, AFS-BM demonstrates superior adaptability and efficiency in handling such challenges. Its iterative refinement of feature selection through a binary mask is effective, offers more precise feature selection and making it a more versatile and robust solution because it continuously evaluates and adjusts which features are most predictive, thus improving focus of the model on truly relevant data for NNs and other advanced machine learning algorithms.

3 Problem Statement

Vectors in this manuscript are represented as column vectors using bold lowercase notation, while matrices are denoted using bold uppercase letters.³³3For a given vector $\bm{x}$ and a matrix $\bm{X}$ , their respective transposes are represented as $\bm{x}^{T}$ and $\bm{X}^{T}$ . The symbol $\odot$ denotes the Hadamard product, which signifies an element-wise multiplication operation between matrices. For any vector $\bm{x}$ , $x_{i}$ represents the $i^{th}$ element. For a matrix $\bm{X}$ , $X_{ij}$ indicates the element in the $i^{th}$ row and $j^{th}$ column. The operation $\sum(\cdot)$ calculates the sum of the elements of a vector or matrix. The L1 norm of a vector $\bm{x}$ is defined as $||\bm{x}||_{1}=\sum_{i}|x_{i}|,$ where the summation runs over all elements of the vector. The number of elements in set $S$ is given by $|S|$ .

We address the feature selection problem in large datasets and explore the challenges associated with time series forecasting and classification tasks. We then examine GBMs and RFEs, concluding with a discussion of widely used feature selection methods in the machine learning literature, such as cross-correlation and mutual information.

We study adaptive feature selection in the context of online learning for prediction/forecasting/classification of non-stationary data, where we adaptively learn the most relevant features. Given a vector sequence $\bm{x}_{1:T}=\{\bm{x}_{t}\}_{t=1}^{T}$ , where $T$ denotes the sequence length and $\bm{x}_{t}\in\mathbb{R}^{M}$ is the feature vector at time $t$ where this input can include features from the target sequence (endogenous) and auxiliary (exogenous) features such as weather or time of the day. The corresponding target output for $\bm{x}_{1:T}$ is $\bm{y}_{1:T}=\{\bm{y}_{t}\}_{t=1}^{T}$ , with $\bm{y}_{t}\in\mathbb{R}^{C}$ being the desired output vector at time $t$ , $C$ where represents the number of components or dimensions in the desired output vector.

In the online learning setting, the goal is to estimate $\bm{y}_{t}$ using only the inputs observed up to time $t$ as $\bm{\hat{y}}_{t}=f_{t}(\bm{y}_{1},\ldots,\bm{y}_{t-1},\bm{x}_{1},\ldots,{\bm{x% }_{t}};\bm{\theta}_{t})$ , where $f_{t}$ is a dynamic nonlinear function parameterized by $\bm{\theta}_{t}$ . After each observation of $\bm{y}_{t}$ , we compute a loss $\mathcal{L}(\bm{y}_{t},\bm{\hat{y}}_{t})$ and update the algorithm parameters in real-time. As an example, for this paper, such as the mean squared error (MSE), is computed over the sequence:

\mathcal{L}_{\text{MSE}}=\frac{1}{T}\sum_{t=1}^{T}\bm{e}_{t}^{T}\bm{e}_{t},

(1)

where $\bm{e}_{t}=\bm{y}_{t}-\bm{\hat{y}}_{t}$ denotes the error vector at time $t$ . Other metrics, such as the mean absolute error (MAE), can also be considered since our method is generic and extends to other loss functions.

Additionally, we address adaptive feature selection for classification tasks. Given a set of feature vectors $\bm{X}=[\bm{x}_{1},\ldots,\bm{x}_{N}]^{T}$ , where $N$ is the total number of samples and $\bm{x}_{i}\in\mathbb{R}^{M}$ is the feature vector for the $i^{th}$ sample, this input can encompass both primary features and auxiliary information. For the dataset $\bm{X}$ , the corresponding target labels are given by $\bm{Y}=[\bm{y}_{1},\ldots,\bm{y}_{N}]^{T}$ . Here, $\bm{y}_{i}\in\mathbb{R}^{|C|}$ , where each class is denoted by $c$ which belongs to the set $C$ and $y_{i,c}=p$ represents the true probability of class $c$ for the $i^{th}$ sample. The goal is to predict the class label $\hat{y}_{i}=\operatorname*{arg\,max}_{c}\hat{y}_{i,c}$ using the feature vector $\bm{x}_{i}$ as $\bm{\hat{y}}_{i}=f(\bm{x}_{i};\bm{\theta})$ , where $f$ is a nonlinear classifier function parameterized by $\bm{\theta}$ and the corresponding target label probabilities are given by $\bm{\hat{Y}}=[\bm{\hat{y}}_{1},\ldots,\bm{\hat{y}}_{N}]^{T}$ where $\bm{\hat{y}}_{i}\in[0,1]^{|C|}$ . Here, $\bm{\hat{y}}_{i}$ represents the probability vector of all class labels for the $i^{th}$ sample, where each class is denoted by $c$ and belongs to the set $C$ . Upon observing the true probability vector of labels $\bm{y}_{i}$ , we incur a loss $\mathcal{L}(\bm{y}_{i},\bm{\hat{y}}_{i})$ and adjust the model based on this loss. The performance of the model is assessed by classification accuracy or other relevant metrics. A commonly used metric is the cross-entropy loss $\mathcal{L}_{\text{CE}}=-\frac{1}{N}\sum_{i=1}^{N}\sum_{c=1}^{C}y_{i,c}\log(% \hat{y}_{i,c})$ where ${y}_{i,c}$ is the true probability of class $c$ for the $i^{th}$ sample, and $\hat{{y}}_{i,c}$ is the predicted probability. Other metrics, such as the F1-score or Area Under the Curve (AUC), can also be used.

As the feature space expands, the risk of overfitting increases, especially when the number of features $M$ approaches the number of samples $N$ [2], often referred to as the “Curse of Dimensionality”. For example, non-stationarity can arise in regression from evolving relationships between variables [18]. In classification, non-stationarity can manifest as shifting class boundaries due to changing class distributions [19].

Among the various feature selection techniques developed, some of the most effective and widely used methods include GBMs for feature importance [20], RFE, and Greedy Methods. These methods provide robust frameworks for identifying significant features within large datasets. Building upon these foundational techniques, we propose a novel method described in the following sections.

4 Adaptive Feature Selection with Binary Masking

Our method, AFS-BM, continually refines its choice of features by utilizing a binary mask. We first introduce the “Model Optimization Phase” (4.1), detailing how features are initially selected and utilized for model training. We then continue to the “Masked Optimization & Feature Selection Phase” (4.2), where we describe the process of refining the feature set and optimizing the binary mask. The combined approach ensures a systematic and adaptive feature selection for improved model performance.

Consider a dataset at the start of the algorithm, which is denoted as $\mathcal{D}=\{(\bm{x}_{i},y_{i})\}_{i=1}^{N+P}$ where $N+P$ is the number of samples in the dataset since we will divide the dataset in further. Here, $\bm{x}_{i}\in\mathbb{R}^{M}$ represent the feature vector for sample $i$ , and $y_{i}$ are the corresponding target value in the dataset $\mathcal{D}$ , respectively. The set of feature vectors is defined as $\bm{X}=[\bm{x}_{1},\ldots,\bm{x}_{N}]^{T}$ and the target vector defined as $\bm{y}=[{y}_{1},\ldots,{y}_{N}]^{T}$ . To extract important features and eliminate redundant features, here, we define a binary mask, $\bm{z}$ as $\bm{z}\in\{0,1\}^{M}$ where $M$ is the total number of features. The binary nature ensures that a feature is either selected (1) or not (0) where the set of feature vectors $\bm{X}$ are modified by the mask with the Hadamard product, such as:

\bm{X}_{modified}=\bm{X}\odot\bm{z}.

(2)

Moreover, the binary mask $\bm{z}$ can be seen as a constraint on the feature space. Therefore, the primary objective is to minimize the loss function while optimizing this mask, which closely approximates the true target values and selects the best features by using the minimum number of features as possible. This problem can be represented as:

\min_{\bm{z},\bm{\theta}}\mathcal{L}(\bm{y},F(\bm{X}\odot\bm{z},\bm{\theta}))+% \frac{||\bm{z}||_{1}}{M},

(3)

as subject to $\bm{z}\in\{0,1\}^{M}$ . Note that (3) does not have a closed-form solution since it includes integer optimization [21]. Hence, we propose an iterative algorithm that overcomes the challenges posed by the integer optimization in (3). Our algorithm leverages the subgradient methods [21], and introduces a relaxation technique for the binary constraints on $\bm{z}$ . In each iteration, the algorithm refines the approximation of $\bm{z}$ and updates $\bm{\theta}$ based on the current estimate. The iterative process continues until convergence, i.e., when the change in the objective function value between consecutive iterations falls below a predefined threshold.

As a solution to the optimization problem presented in (3), our algorithm AF-BM detailed in Algorithm 1 iteratively performs both binary mask and loss optimization. The binary mask acts as a dynamic filter, allowing the algorithm to select or discard features during the learning process actively. This ensures that only the most relevant features are used, thereby improving the accuracy of the model. The iterative optimization refines both the parameters of the model and the binary mask simultaneously, ensuring that the feature selection process remains aligned with the learning objectives of the model. The algorithm continues its iterative process until the binary mask remains unchanged between iteration cycles, providing a robust and efficient solution to the problem.

The AFS-BM algorithm begins by initializing slack thresholds based on the cross-validation results, which are represented by the variables $\mu$ and $\beta$ , as well as a positive real number $\Delta\mathcal{L}$ in line 1 of Algorithm 1. At the start of the algorithm, specifically at iteration $k=0$ , two distinct datasets are used. The first dataset, denoted by $\mathcal{D}^{(0)}$ , is used for model optimization and is defined as $\mathcal{D}^{(0)}=\{(\bm{x}_{i},y_{i})\}_{i=1}^{N}$ . The second dataset, represented by $\mathcal{D}_{m}^{(0)}$ , is employed for the masked feature selection process and is given by $\mathcal{D}_{m}^{(0)}=\{(\bm{x}_{i}^{(m)},y_{i}^{(m)})\}_{i=1}^{P}$ . Both the definitions of $\mathcal{D}_{m}^{(0)}$ and $\mathcal{D}_{m}^{(0)}$ are given in line 2. In the notation, the superscript in $\mathcal{D}^{(0)}$ and $\mathcal{D}_{m}^{(0)}$ indicates the iteration number, which in this case is the initial iteration, i.e., 0 and the subscript “ $m$ ” in $\mathcal{D}_{m}$ signifies that this dataset is specifically used for “masked” feature selection. Here, $N$ and $P$ are the numbers of samples in the datasets. $\bm{x}_{i},\bm{x}_{i}^{(m)}\in\mathbb{R}^{M}$ represent the feature vectors for sample $i$ , and $y_{i},y_{i}^{(m)}$ are the corresponding target values in model optimization and masked feature selection datasets, respectively. The set of feature vectors is defined as $\bm{X}^{(0)}=[\bm{x}_{1},\ldots,\bm{x}_{N}]^{T}$ and $\bm{X}^{(0)}_{m}=[\bm{x}_{1}^{(m)},\ldots,\bm{x}_{P}^{(m)}]^{T}$ in line 3. The process begins with the initialization of the binary mask vector in line 4, $\bm{z}^{(0)}\in\{0,1\}^{M}$ , with all entries set to one, indicating the inclusion of all features. Our goal is to train a model and optimize a binary mask accordingly, $F_{k}(\bm{x}_{i},\bm{\theta})$ and $\bm{z}$ , that selects best features at the $k$ -th iteration.

4.1 Model Optimization Phase of AFS-BM

In this phase, the focus is on using a binary mask from a prior iteration to identify the feature subset, which subsequently results in the formation of a masked dataset. The main objective is to fine-tune the model by reducing a given loss function while kee** the mask unchanged. After this optimization, a test loss is calculated and compared with a benchmark loss calculated on a separate masked dataset. This threshold enables the upcoming feature selection phase, ensuring the adaptive feature extraction.

The main loop of the algorithm continues until a stop** criterion, defined by $\beta\neq 0$ , is met at line 5. Within this loop, before the $k$ -th feature selection iteration, the feature subset is determined by the binary mask $\bm{z}^{(k-1)}$ , which is optimal at iteration $k-1$ concerning the minimization of the loss function from the previous iteration. This means that the mask $\bm{z}^{(k-1)}$ was found to provide the best feature subset that resulted in the lowest loss or error for the model during the $k-1$ iteration. First, at line 6, the algorithm initializes a model $F_{k}(\bm{X}^{(k)},\bm{\theta})$ , and the $\bm{z}^{(k-1)}$ masks the feature vector and generate $\bm{x}_{i}^{masked}$ for the model training in line 7 consecutively $\bm{x}_{i}^{masked}=\bm{x}_{i}\odot\bm{z}^{(k-1)}$ which resulted in the following dataset in line 8, $\mathcal{D}^{(k)}=\{(\bm{x}_{i}^{masked},y_{i})\}_{i=1}^{N}$ . Hence, $\bm{X}^{(k)}=[\bm{x}^{masked}_{1},\ldots,\bm{x}^{masked}_{N}]^{T}$ ,where $\bm{x}^{masked}_{i}\in\mathbb{R}^{M}$ is the feature vector for the $i^{th}$ sample. Training involves minimizing the loss function $\mathcal{L}(\bm{y},F_{k}(\bm{X}^{(k)},\bm{\theta}))$ with respect to the parameters of the model, $\bm{\theta}$ since the mask $\bm{z}^{(k-1)}$ is constant at this step which can be seen on line 10:

\bm{\hat{\theta}}_{k}=\operatorname*{arg\,min}_{\bm{\theta}}\mathcal{L}(\bm{y}% ,F_{k}(\bm{X}^{(k)},\bm{\theta})).

(4)

After optimizing the model, we determine the test loss value using the current best parameters, denoted as $\bm{\hat{\theta}}_{k}$ . This loss value is then defined as $\mathcal{L}_{th}$ and will serve as a reference point or threshold for the subsequent phase of the algorithm. To calculate $\mathcal{L}_{th}$ , we use the masked feature selection dataset $\mathcal{D}_{m}$ as a test set for the model optimization phase of the AFS-BM algorithm. First, we mask the feature vector with the most recent optimal mask $\bm{z}^{(k-1)}$ from the previous iteration and generate ${\bm{x}_{i}^{(m)}}^{masked}$ . Then, we update $\mathcal{D}_{m}^{(k-1)}$ to calculate the $\mathcal{L}_{th}$ from line 11 to 14:

		$\displaystyle{\bm{x}_{i}^{(m)}}^{masked}=\bm{x}_{i}^{(m)}\odot\bm{z}^{(k-1)},$		(5)
		$\displaystyle\mathcal{D}_{m}^{(k-1)}=\{({\bm{x}_{i}^{(m)}}^{masked},y_{i}^{(m)% })\}_{i=1}^{P},$		(6)
		$\displaystyle\bm{X}_{m}^{(k-1)}=[{\bm{x}_{1}^{(m)}}^{masked},\ldots,{\bm{x}_{P% }^{(m)}}^{masked}]^{T},$		(7)
		$\displaystyle\mathcal{L}_{th}=\mathcal{L}(\bm{y}^{(m)},F_{k}(\bm{X}_{m}^{(k-1)% },\bm{\hat{\theta}}_{k})),$		(8)

where we define a test set of feature vectors at step $(k-1)$ as $\bm{X}_{m}^{(k-1)}=[{\bm{x}_{1}^{(m)}}^{masked},\ldots,{\bm{x}_{P}^{(m)}}^{% masked}]^{T}$ , where ${\bm{x}_{i}^{(m)}}^{masked}\in\mathbb{R}^{M}$ is the masked feature vector for the $i^{th}$ sample of masked feature selection dataset. After calculating the threshold loss $\mathcal{L}_{th}$ , we proceed with the masked feature selection phase.

4.2 Masked Optimization & Feature Selection Phase of AFS-BM

In this phase, the AFS-BM algorithm undergoes a thorough feature extraction via mask optimization after training. It sets a slack variable to guide the optimization duration and uses a temporal mask to evaluate feature relevance. By iteratively masking out features and observing the impact on the loss of the model, the algorithm differentiates between essential and redundant features. This iterative process continues until the feature set stabilizes, ensuring the model is equipped with the most significant features for the minimum loss on the given dataset.

After training, the mask optimization and feature selection process starts. In this process we use the slack variable $\mu\in\mathbb{Z}^{+}$ to stop the mask optimization process, which is initialized in line 1 and adjusted based on cross-validation and only affects the algorithm’s computation time of the algorithm. The mask optimization phase starts with initializing a temporal mask $\bm{\hat{z}}^{(k-1)}$ that copies the binary mask from the previous iteration. We define the temporal mask $\bm{\hat{z}}^{(k-1)}$ at line 15 as:

\bm{\hat{z}}^{(k-1)}\leftarrow\bm{z}^{(k-1)}.

(9)

As long as $\mu\neq 0$ , which is met at line 17, we continue with the index selection procedure. During this phase, the algorithm selects a unique index $i$ from a set of available indices $S$ based on a Gaussian distribution from lines 18 to 19. This index selection procedure is described as follows:

Let $\mathcal{S}={\{1,\ldots,M\}}$ be a set of indices with dimension $\dim{\mathcal{S}}=M$ . An index $i$ is selected from $\mathcal{S}$ such that $i\in\mathcal{S}$ . The selection of index $i$ from $\mathcal{S}$ is based on a uniform distribution. This probabilistic mechanism enhances the adaptiveness and diversity of the feature selection process. Once chosen, $i$ cannot be reselected, preserving the uniqueness of the selection. This is expressed as $i\in\mathcal{S},\quad i\notin\{j\mid j\text{ has been previously selected}$ .

Upon selecting index $i$ , we set corresponding element of the temporal mask ${\hat{z}_{i}}^{(k-1)}$ to 0 as at line 20, $\hat{z}_{i}^{(k-1)}\leftarrow 0$ .

With the temporal mask, $\bm{\hat{z}}^{(k-1)}$ , we generate temporal feature vectors ${\bm{\hat{x}}_{i}^{(m)^{masked}}}$ , a feature vector set $\bm{\hat{X}}_{m}^{(k)}$ and a dataset $\mathcal{\hat{D}}_{m}^{(k)}$ from lines 21 to 24:

	$\displaystyle{\bm{\hat{x}}_{i}^{(m)^{masked}}}=\bm{x}_{i}^{(m)}\odot\bm{\hat{z% }}^{(k-1)},$		(10)
	$\displaystyle\mathcal{\hat{D}}_{m}^{(k)}=\{({\bm{\hat{x}}_{i}^{(m)^{masked}}},% y_{i}^{(m)})\}_{i=1}^{P},$		(11)
	$\displaystyle\bm{\hat{X}}_{m}^{(k)}=[\bm{\hat{x}}_{1}^{(m)^{masked}},\ldots,% \bm{\hat{x}}_{N}^{(m)^{masked}}]^{T},$		(12)
	$\displaystyle\mathcal{L}_{mask}=\mathcal{L}(\bm{y}^{(m)},F_{k}(\bm{\hat{X}}_{m% }^{(k)},\bm{\hat{\theta}}_{k})),$		(13)

and $\bm{\hat{X}}_{m}^{(k)}=\bm{X}_{m}^{(k-1)}\setminus\{i\}$ where $\bm{X}_{m}^{(k-1)}\setminus\{i\}$ denotes the feature set without feature $i$ . We define the $\mathcal{L}_{mask}$ as $\mathcal{L}_{mask}=\mathcal{L}(\bm{y}^{(m)},F_{k}(\bm{X}_{m}^{(t-1)}\setminus% \{i\},\bm{\hat{\theta}}_{t}))$ .

With the generated temporal feature vectors using the updated temporal mask and the algorithm calculates a new loss $\mathcal{L}_{mask}$ in line 25. Moreover, we define that a feature $i$ is said to be relevant if and only if $\frac{\mathcal{L}(\bm{y}^{(m)},F_{k}(\bm{X}_{m}^{(k-1)}\setminus\{i\},\bm{\hat% {\theta}}_{k}))-\mathcal{L}(\bm{y}^{(m)},F_{k}(\bm{X}_{m}^{(k-1)},\bm{\hat{% \theta}}_{k})))}{\mathcal{L}(\bm{y}^{(m)},F_{k}(\bm{X}_{m}^{(k-1)},\bm{\hat{% \theta}}_{k})))}\leq\Delta\mathcal{L}$ or $\frac{\mathcal{L}_{mask}-\mathcal{L}_{th}}{\mathcal{L}_{th}}\leq\Delta\mathcal% {L}$ , where $\bm{X}_{m}^{(k-1)}\setminus\{i\}$ denotes the feature set without feature $i$ and $\Delta\mathcal{L}$ denotes a predefined threshold which is also initialized at line 1 and adjusted by cross-validation and only affects the sensitivity of determining feature’s relevance. The mask’s state is retained if the feature is relevant, i.e., the loss remains unchanged or changes by less than a predefined threshold, $\Delta\mathcal{L}$ as ${\hat{z}}^{(k-1)}_{i}=\begin{cases}0,&\text{if }\frac{\mathcal{L}_{mask}-% \mathcal{L}_{th}}{\mathcal{L}_{th}}\leq\Delta\mathcal{L}\\ 1,&\text{otherwise}.\end{cases}$

If the loss does not exceed the threshold, we update $\mathcal{L}_{\text{th}}$ to $\mathcal{L}_{\text{mask}}$ . Otherwise, we decrement $\mu$ and move to the next randomly chosen index. Let $\delta\mathcal{L}$ denote the relative change in the loss function when feature $i$ is removed, i.e., when ${\hat{z}}^{(k-1)}_{i}$ is set to 0. Formally, $\delta\mathcal{L}=\frac{\mathcal{L}_{mask}-\mathcal{L}_{th}}{\mathcal{L}_{th}}$ . Given the threshold $\Delta\mathcal{L}$ , a feature $i$ is deemed irrelevant if $\delta\mathcal{L}\leq\Delta\mathcal{L}$ . This implies that the performance of the model does not degrade by more than $\Delta\mathcal{L}$ when feature $i$ is removed. Therefore, for any feature $i$ for which $\delta\mathcal{L}\leq\Delta\mathcal{L}$ , the binary mask ensures that ${\hat{z}}^{(k-1)}_{i}$ , ensuring that only features that significantly contribute to the performance of the model are retained. This process is shown in from lines 26 to 33. The mask optimization concludes when $\mu=0$ . Once the mask optimization phase concludes, at line 34, we then employ the optimized mask to eliminate redundant features:

	$\displaystyle\textbf{DeleteColumns}(\bm{X}^{(k)},\bm{X}_{m}^{(k-1)},\bm{\hat{z% }}^{(k-1)})=$		(14)
	$\displaystyle\bm{X}^{(k+1)},\bm{X}_{m}^{(k)},\bm{z}^{(k)}.$		(15)

The function, $\textbf{DeleteColumns}(\cdot,\cdot,\cdot)$ , removes feature columns from the datasets $\bm{X}^{(k)}$ , $\bm{X}_{m}^{(k-1)}$ and $\bm{z}^{(k-1)}$ based on the temporal binary mask $\bm{\hat{z}}^{(k-1)}$ . The mask $\bm{z}$ is stable if:

\bm{z}^{(k)}=\bm{z}^{(k-1)},

for a predefined number of consecutive iterations $\beta$ , which is again adjusted by cross-validation and only affects the computation time of the algorithm. If the mask remains unchanged for a predefined number of consecutive iterations, represented by $\beta$ , the algorithm determines that it reached an optimal set of features and either proceeds to the next $(k+1)$ -th model optimization accordingly or terminates in between lines 35 and 38.

The complete description of the algorithm can be found in Algorithm 1.

Algorithm 1 Adaptive Feature Selection with Binary Masking (AFS-BM)

1:Initialize

\mu,\beta\in\mathbb{Z}^{+};\Delta\mathcal{L}\in\mathbb{R}^{+}

\mathcal{D}^{(0)}=\{(\bm{x}_{i},y_{i})\}_{i=1}^{N},\mathcal{D}_{m}^{(0)}=\{(% \bm{x}_{i}^{(m)},y_{i}^{(m)})\}_{i=1}^{P}

\bm{X}^{(0)}=[\bm{x}_{1},\ldots,\bm{x}_{N}]^{T},\bm{X}_{m}^{(0)}=[\bm{x}_{1},% \ldots,\bm{x}_{P}]^{T}

\bm{z}^{(0)}\in\{0,1\}^{M}

5:while stop criterion

\beta\neq 0

6: Initialize

F_{k}(\bm{X}^{(k)},\bm{\theta})

\bm{x}_{i}^{masked}=\bm{x}_{i}\odot\bm{z}^{(k-1)}

\mathcal{D}^{(k)}=\{(\bm{x}_{i}^{masked},y_{i})\}_{i=1}^{N}

\bm{X}^{(k)}=[\bm{x}^{masked}_{1},\ldots,\bm{x}^{masked}_{N}]^{T}

10:

\bm{\hat{\theta}}_{k}=\operatorname*{arg\,min}_{\bm{\theta}}\mathcal{L}(\bm{y}% ,F_{k}(\bm{X}^{(k)},\bm{\theta}))

11:

{\bm{x}_{i}^{(m)}}^{masked}=\bm{x}_{i}^{(m)}\odot\bm{z}^{(k-1)}

12:

\mathcal{D}_{m}^{(k-1)}=\{({\bm{x}_{i}^{(m)}}^{masked},y_{i}^{(m)})\}_{i=1}^{P}

13:

\bm{X}_{m}^{(k-1)}=[\bm{x}_{1}^{(m)},\ldots,\bm{x}_{P}^{(m)}]^{T}

14:

\mathcal{L}_{th}=\mathcal{L}(\bm{y}^{(m)},F_{k}(\bm{X}_{m}^{(k-1)},\bm{\hat{% \theta}}_{k}))

15: Define

\bm{\hat{z}}^{(k-1)}\leftarrow\bm{z}^{(k-1)}

16: Initialize set of available indices:

\mathcal{S}

17: while stop criterion

\mu\neq 0

18: Randomly select

i

from

\mathcal{S}

19: Remove

i

from

\mathcal{S}

20:

\hat{z}_{i}^{(k-1)}=0

21:

{\bm{\hat{x}}_{i}^{(m)^{masked}}}=\bm{x}_{i}^{(m)}\odot\bm{\hat{z}}^{(k-1)}

22:

\mathcal{\hat{D}}_{m}^{(k)}=\{({\bm{\hat{x}}_{i}^{(m)^{masked}}},y_{i}^{(m)})% \}_{i=1}^{P}

23:

\bm{\hat{X}}_{m}^{(t)}=[\bm{\hat{x}}_{1}^{(m)^{masked}},\ldots,\bm{\hat{x}}_{P% }^{(m)^{masked}}]^{T}

24:

\bm{X}_{m}^{(k-1)}\setminus\{i\}=\bm{\hat{X}}_{m}^{(k)}

25:

\mathcal{L}_{mask}=\mathcal{L}(\bm{y}^{(m)},F_{k}(\bm{X}_{m}^{(t-1)}\setminus% \{i\},\bm{\hat{\theta}}_{t}))

26: if

\frac{\mathcal{L}_{mask}-\mathcal{L}_{best}}{\mathcal{L}_{best}}\leq\Delta% \mathcal{L}

then

27:

\hat{z}_{i}^{(k-1)}\leftarrow 0

28:

\mathcal{L}_{th}\leftarrow\mathcal{L}_{mask}

29: else

30:

\hat{z}_{i}^{(k-1)}\leftarrow 1

31:

\mu\leftarrow\mu-1

32: end if

33: end while

34:

\textbf{DeleteColumns}(\bm{X}^{(k)},\bm{X}_{m}^{(k-1)},\bm{\hat{z}}^{(k-1)})=% \bm{X}^{(k+1)},\bm{X}_{m}^{(k)},\bm{z}^{(k)}

35: if

\bm{z}^{(k-1)}=\bm{z}^{(k)}

then

36:

\beta\leftarrow\beta-1

37: end if

38:end while

5 Simulations

In this study, we aim to compare the AFS-BM algorithm with the widely used feature selection approaches over the well-established real-life competition datasets. The efficacy of AFS-BM is demonstrated across both regression tasks, with Gradient Boosting Machines (GBMs) and Neural Networks (NNs) serving as the underlying models. Note that our algorithm is generic can be applied to other machine learning algorithms, however, we concentrate on these two since they are widely used in the literature.

5.1 Implementation and Evaluation of AFS-BM

Our experimental framework includes both synthetic and well-known competition datasets, and regression tasks. For regression analyses, first, we evaluate the performance of the AFS-BM algorithm compared to the other feature selection algorithms on the synthetic dataset. We start with a synthetic dataset to perform a controlled environment that allows us to thoroughly evaluate the algorithm’s behavior and effectiveness in scenarios where the ground truth is known. This initial step provides valuable insights into the algorithm’s capabilities before applying it to real-world datasets. Then, we continue with the real-life datasets including the M4 Forecasting Competition Dataset [22] and The Istanbul Stock Exchange Hourly Dataset [23]. A crucial aspect of our study is performance benchmarking, where we compare the results of our AFS-BM feature selection algorithm against other prevalent methods in the field. Using a validation set, we optimize all hyperparameters for LightGBM, XGBoost, MLP, and the feature selection methods to remove any spurious or coincidental correlations that might arise from overfitting. The hyperparameter search space is also cross-validated.

5.1.1 Regression Experiments and Results

Our regression experiments use the daily series of the M4 Forecasting Competition Dataset [22] and the Istanbul Stock Exchange Hourly Dataset [23]. We will first configure the AFS-BM algorithm for each dataset and then compare its performance against traditional feature selection methods. The experiments will also explore the impact of hyperparameter tuning on AFS-BM’s accuracy, concluding with insights into its overall effectiveness.

Time Series Feature Engineering and Algorithmic Evaluation on the Daily M4 Forecasting Dataset

To predict the target value $y_{t}$ , we utilize the daily series from the M4 Forecasting Competition Dataset, which initially contains only the target values $y_{t}$ . For feature engineering, we consider the first three lags of $y_{t}$ as $y_{t-1}$ , $y_{t-2}$ , and $y_{t-3}$ , and the 7th, 14th, and 28th lags as $y_{t-7}$ , $y_{t-14}$ , and $y_{t-28}$ respectively. Beyond these basic lag features, we compute rolling statistics, specifically, the rolling mean and rolling standard deviation with window sizes of 4, 7, and 28 until $y_{t}$ .

To remove any bias to a particular time series, we select 100 random series from the dataset, normalize them to $[-1,1]$ after mean removal. Since the data length is not uniform for each series, we reserve the last $10\%$ of total samples for testing, $20\%$ of total samples before the testing as mask validation, and $20\%$ of total samples before the mask validation as model validation. To underscore the significance of feature selection, we utilize the default configurations of LightGBM, XGBoost, and Multi-Layer Perceptron (MLP) without incorporating any specific feature selection techniques. We compare our algorithm against the well-known feature selection methods, Cross-Correlation, Mutual Information, and RFE. Using a validation set, we optimize all hyperparameters for LightGBM, XGBoost, MLP, and the feature selection methods. Both our introduced AFS-BM algorithm and the other feature selection methods employ LightGBM and XGBoost as their primary boosting techniques.

For each randomly selected time series, denoted as $y_{t}^{(s)}$ , where $s=1,\ldots,100$ , we apply all the algorithms for feature selection and then compute the test loss. The test loss are given by $l_{t}^{(s)}$ . Here, $l_{t}^{(s)}\triangleq(y_{t}^{(s)}-\hat{y}^{(s)}_{t})^{2}$ , and $t$ ranges from 1 up to $t_{max}$ , the latter being the longest time duration among all $y_{t}^{(s)}$ for $k=1,\ldots,100$ . For datasets with shorter durations, we pad the loss sequences with zeros at the end to ensure consistency. To eliminate any bias towards a specific sequence, we compute the average of these loss sequences, yielding the final averaged squared error: $l_{t}^{(ave)}=\frac{1}{100}\sum_{s=1}^{100}l_{t}^{(s)}$ . For further refinement, we also average over time, resulting in $l_{t}^{(ave2)}=\frac{1}{t}\sum_{j=1}^{t}l_{j}^{(ave)}$ . The outcomes for the LightGBM-based algorithms are depicted in Fig. 1, while those for the XGBoost-based algorithms are illustrated in Fig. 2.

Refer to caption — Figure 1: Comparison of averaged loss over time, $l_{t}^{(ave2)}$ , for the experiments on M4 Competition data with LightGBM-based algorithms.

Additionally, Table 1 offers a detailed comparison of our algorithm against the standard LightGBM, XGBoost, and their respective implementations with Cross-Correlation, Mutual Information, and RFE algorithms.

Algorithm\Base Model	LightGBM	XGBoost
Cross-Correlation Algorithm	$4.7895\text{\times}{10}^{-2}\text{\,}$	$9.1995\text{\times}{10}^{-2}\text{\,}$
Mutual Information Algorithm	$4.6626\text{\times}{10}^{-2}\text{\,}$	$9.0114\text{\times}{10}^{-2}\text{\,}$
RFE Algorithm	$4.5212\text{\times}{10}^{-2}\text{\,}$	$8.8511\text{\times}{10}^{-2}\text{\,}$
Vanilla Algorithm	$4.7691\text{\times}{10}^{-2}\text{\,}$	$8.9000\text{\times}{10}^{-2}\text{\,}$
AFS-BM Algorithm	$\bm{3.8251\times 10^{-2}}$	$\bm{5.4094\times 10^{-2}}$

Table 1: Final MSE Results of the Experiments on M4 Competition Daily Dataset with GBM-based Learners

Moreover, we apply the MLP model and other algorithms to the test data for each series, resulting in the loss sequences. The performance trends of the algorithms over time are visualized in Figure 3, which provides a graphical representation of the averaged loss sequences for the experiments on the M4 Competition data with MLP-based algorithms.

Additionally, Table 2 offers a detailed comparison of our algorithm against the standard MLP and their respective implementations with Cross-Correlation, Mutual Information, and RFE algorithms.

Algorithm\Base Model	MLP
Cross-Correlation Algorithm	$4.8052\text{\times}{10}^{-2}\text{\,}$
Mutual Information Algorithm	$4.6781\text{\times}{10}^{-2}\text{\,}$
Vanilla Algorithm	$4.7845\text{\times}{10}^{-2}\text{\,}$
AFS-BM Algorithm	$\bm{3.8385\times 10^{-2}}$

Table 2: Final MSE Results of the Experiments on M4 Competition Daily Dataset with MLP

Time Series Feature Engineering and Algorithmic Evaluation on the Istanbul Stock Exchange Dataset

To forecast the target value $y_{t}$ , we turn to the Istanbul Stock Exchange Dataset[23], which provides hourly data points. Initially, this dataset offers only the target values, denoted as $y_{t}$ . To enrich the predictive capability of our model, we perform a systematic feature engineering process. Firstly, we extract immediate temporal dependencies by considering the three most recent lags of $y_{t}$ , represented as $y_{t-1}$ , $y_{t-2}$ , and $y_{t-3}$ . Recognizing the significance of longer-term patterns in hourly data, especially over the course of a day or multiple days, we also incorporate the 24th, 36th, and 48th lags, denoted as $y_{t-24}$ , $y_{t-36}$ , and $y_{t-48}$ respectively. In addition to these lags, we also factor in the cyclical nature of time by taking the sine and cosine values of the timestamp, which includes the month, day, and hour. This helps in capturing the cyclic patterns associated with daily and monthly rhythms.

In addition to these features, we also calculate the rolling statistics to capture more nuanced patterns. Specifically, we calculate the rolling mean and rolling standard deviation for both recent ( $y_{t-1}$ ) and daily ( $y_{t-24}$ ) lags. These calculations are performed over window sizes of 4 (representing a shorter part of the day), 12 (half a day), and 24 (a full day) hours. Hence, we get a comprehensive set of features from the time series data.

Following mean removal, the dataset is first normalized to $[-1,1]$ . Given the dataset, $10\%$ of total samples are designated for testing. Furthermore, $20\%$ of the samples leading up to the test set are reserved for mask validation, while an additional $20\%$ preceding the mask validation is allocated for model validation. LightGBM, XGBoost, and MLP are utilized in their standard configurations, bypassing specialized feature selection methods. Our algorithm is compared with the algorithms that are the same as the previous test sets. All hyperparameters associated with LightGBM, XGBoost, and MLP undergo optimization based on the model validation set to ensure optimal performance.

The findings based on Mean Squared Error (MSE) loss and the number of selected features for both LightGBM and XGBoost-based algorithms are displayed in Table 3 and Table 4.

Algorithm\Base Model	LightGBM	XGBoost
Cross-Correlation Algorithm	$2.5564\text{\times}{10}^{-2}\text{\,}$	$2.5876\text{\times}{10}^{-2}\text{\,}$
Mutual Information Algorithm	$2.8480\text{\times}{10}^{-2}\text{\,}$	$2.5536\text{\times}{10}^{-2}\text{\,}$
RFE Algorithm	$4.5903\text{\times}{10}^{-3}\text{\,}$	$5.8105\text{\times}{10}^{-3}\text{\,}$
Vanilla Algorithm	$4.3225\text{\times}{10}^{-3}\text{\,}$	$5.5866\text{\times}{10}^{-3}\text{\,}$
AFS-BM Algorithm	$\bm{3.8350\times 10^{-3}}$	$\bm{3.9445\times 10^{-3}}$

Table 3: Final MSE Results of the Experiments on Istanbul Stock Exchange Hourly Dataset with GBM-based Learners

Base Model	Method	Selected Features
LightGBM	AFS-BM	2
XGBoost	AFS-BM	8
LightGBM	Cross-Correlation	66
XGBoost	Cross-Correlation	66
LightGBM	Mutual Information	97
XGBoost	Mutual Information	101
LightGBM	RFE	10
XGBoost	RFE	5

Table 4: The number of selected features selected from a total of 176 features by each feature selection method on the Istanbul Stock Exchange Hourly Dataset with GBM-based Learners

Furthermore, the dataset undergoes evaluation using the MLP model. MLP, with its layered architecture, is adept at capturing nonlinear relationships in the data, making it a suitable choice for such complex datasets. The results based on Mean Squared Error (MSE) loss for the MLP-based algorithms are presented in Table 5.

Algorithm\Base Model	MLP
Cross-Correlation Algorithm	$2.5652\text{\times}{10}^{-2}\text{\,}$
Mutual Information Algorithm	$2.8603\text{\times}{10}^{-2}\text{\,}$
Vanilla Algorithm	$4.3350\text{\times}{10}^{-3}\text{\,}$
AFS-BM Algorithm	$\bm{3.8423\times 10^{-3}}$

Table 5: Final MSE Results of the Experiments on Istanbul Stock Exchange Hourly Dataset with MLP

A comparative analysis reveals that the AFS-BM algorithm consistently surpasses other methods across all regression experiments. The main strength of AFS-BM is its ability to adapt its feature selection as data patterns change. It does this by checking the error at every step. This adaptability is different from fixed methods such as Cross-Correlation and Mutual Information. AFS-BM can effectively handle both short-term and long-term changes in time series data. When combined with cross-validated hyperparameter selection, the algorithm works at its best. In summary, our experimental results demonstrate the superior performance of AFS-BM algorithm over the state-of-the art feature selection methods.

6 Conclusion

In this paper, we have addressed the critical problem of feature selection in general machine learning models, recognizing its critical role in both model accuracy and efficiency. Traditional feature selection methods struggle with significant challenges such as scalability, managing high-dimensional data, handling correlated features, adapting to shifting feature importance, and integrating domain knowledge. To this end, we introduced the “Adaptive Feature Selection with Binary Masking” (AFS-BM) algorithm.

AFS-BM differs from the current algorithms since it uses joint optimization for both dynamic and stochastic feature selection and model training, a method that involves binary masking. This approach enables the algorithm to adapt continuously by adjusting features and model parameters during training, responding to changes in feature importance measured by the loss metric values. This adaptability ensures that AFS-BM retains essential features while discarding less relevant ones based on evaluation results, rather than relying solely on feature importance calculated using intuitive reasoning. This is crucial since depending solely on feature importance can lead to the removal of informative features, as it evaluates them within the context of the feature set, rather than considering their contribution to final model performance. AFS-BM’s approach prevents the removal of informative features, ultimately enhancing model accuracy. To encourage further research and allow others to replicate our results, we openly share our source code⁴⁴4https://github.com/YigitTurali/AFS_BM-Algorithm.

Statements and Declarations

•

Competing interests: The authors declare that they have no known competing financial interests or personal relationships that could have influenced influence the work reported in this paper.
•

Availability of data and materials : The data that support the findings of this study is openly available in UCI Machine Learning Repository at https://archive.ics.uci.edu.

References

Capobianco [2022] Capobianco, E. High-dimensional role of AI and machine learning in cancer research. British Journal of Cancer 2022, 126, 523–532.
Bellman [1961] Bellman, R. E. Adaptive Control Processes: A Guided Tour; Princeton University Press: Princeton, 1961.
Bishop [2006] Bishop, C. M. Pattern Recognition and Machine Learning (Information Science and Statistics); Springer-Verlag: Berlin, Heidelberg, 2006.
Farmanbar and Toygar [2016] Farmanbar, M.; Toygar, Ö. Feature selection for the fusion of face and palmprint biometrics. Signal, Image and Video Processing 2016, 10, 951–958.
Guyon and Elisseeff [2003] Guyon, I.; Elisseeff, A. An Introduction to Variable and Feature Selection. 2003; p 1157–1182.
Lajevardi and Hussain [2012] Lajevardi, S. M.; Hussain, Z. M. Automatic facial expression recognition: feature extraction and selection. Signal, Image and Video Processing 2012, 6, 159–169.
Peng et al. [2005] Peng, H.; Long, F.; Ding, C. Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Transactions on Pattern Analysis and Machine Intelligence 2005, 27, 1226–1238.
Kira and Rendell [1992] Kira, K.; Rendell, L. A. In Machine Learning Proceedings 1992; Sleeman, D., Edwards, P., Eds.; Morgan Kaufmann: San Francisco (CA), 1992; pp 249–256.
Rida et al. [2016] Rida, I.; Almaadeed, S.; Bouridane, A. Gait recognition based on modified phase-only correlation. Signal, Image and Video Processing 2016, 10, 463–470.
Tibshirani [1996] Tibshirani, R. Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society. Series B (Methodological) 1996, 58, 267–288, Full publication date: 1996.
Breiman [2001] Breiman, L. Random Forests. Machine Learning 2001, 45, 5–32.
Saeys et al. [2007] Saeys, Y.; Inza, I.; Larrañaga, P. A review of feature selection techniques in bioinformatics. Bioinformatics 2007, 23, 2507–2517.
Atan et al. [2019] Atan, O.; Zame, W. R.; Feng, Q.; van der Schaar, M. Constructing effective personalized policies using counterfactual inference from biased data sets with many features. Machine Learning 2019, 108, 945–970.
Subasi [2015] Subasi, A. A decision support system for diagnosis of neuromuscular disorders using DWT and evolutionary support vector machines. Signal, Image and Video Processing 2015, 9, 399–408.
Ruszczak et al. [2024] Ruszczak, B.; Smykała, K.; Tomaszewski, M.; Navarro Lorente, P. J. Various tomato infection discrimination using spectroscopy. Signal, Image and Video Processing 2024,
Chen and Guestrin [2016] Chen, T.; Guestrin, C. Xgboost: A scalable tree boosting system. Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining. 2016; pp 785–794.
Goodfellow et al. [2016] Goodfellow, I.; Bengio, Y.; Courville, A. Deep Learning; MIT Press, 2016; http://www.deeplearningbook.org.
Junejo et al. [2013] Junejo, I. N.; Bhutta, A. A.; Foroosh, H. Single-class SVM for dynamic scene modeling. Signal, Image and Video Processing 2013, 7, 45–52.
Aguiar et al. [2023] Aguiar, G.; Krawczyk, B.; Cano, A. A survey on learning from imbalanced data streams: taxonomy, challenges, empirical study, and reproducible experimental framework. Machine Learning 2023,
Zhang et al. [2022] Zhang, N.; Chen, M.; Yang, F.; Yang, C.; Yang, P.; Gao, Y.; Shang, Y.; Peng, D. Forest Height Map** Using Feature Selection and Machine Learning by Integrating Multi-Source Satellite Data in Baoding City, North China. Remote Sensing 2022, 14.
Sun and Dai [2018] Sun, C.; Dai, R. Distributed Optimization for Convex Mixed-Integer Programs based on Projected Subgradient Algorithm. 2018 IEEE Conference on Decision and Control (CDC). 2018; pp 2581–2586.
Makridakis et al. [2020] Makridakis, S.; Spiliotis, E.; Assimakopoulos, V. The M4 Competition: 100,000 time series and 61 forecasting methods. International Journal of Forecasting 2020, 36, 54–74, M4 Competition.
Akbilgic [2013] Akbilgic, O. ISTANBUL STOCK EXCHANGE. UCI Machine Learning Repository, 2013; DOI: https://doi.org/10.24432/C54P4J.