Histogram Layers for Neural “Engineered” Features

Joshua Peeples, , Salim Al Kharsa, Luke Saleh, and Alina Zare J. Peeples is an Assistant Professor and S. Al Kharsa is an undergraduate student in the Department of Electrical and Computer Engineering, Texas A&M University, College Station, TX, 77843, USA, e-mails: [email protected] and [email protected]. Saleh is an undergraduate student and A. Zare is a Professor in the Department of Electrical and Computer Engineering, University of Florida, Gainesville, FL, 32608, USA, emails: [email protected] and [email protected].

Abstract

In the computer vision literature, many effective histogram-based features have been developed. These “engineered” features include local binary patterns and edge histogram descriptors among others and they have been shown to be informative features for a variety of computer vision tasks. In this paper, we explore whether these features can be learned through histogram layers embedded in a neural network and, therefore, be leveraged within deep learning frameworks. By using histogram features, local statistics of the feature maps from the convolution neural networks can be used to better represent the data. We present neural versions of local binary pattern and edge histogram descriptors that jointly improve the feature representation and perform image classification. Experiments are presented on benchmark and real-world datasets. Our code is publicly available¹¹1https://github.com/Advanced-Vision-and-Learning-Lab/NEHD_NLBP.

Index Terms:

Deep Learning, Histograms, Feature Learning, Feature Engineering.

I Introduction

Before the popularity of deep learning, feature engineering often (and still does) played a vital role in the fields of computer vision and machine learning. Examples of engineered features include local binary patterns [ojala1994performance] (LBP) and edge histogram descriptors (EHD) [manjunath2002introduction]. Extracting and selecting the best features (i.e., feature engineering) overall was difficult. Many of these engineered features were created through the use of histograms, but there were many parameters (such as bin centers and widths) that were difficult to tune. As an alternative to the difficult and time-consuming process of feature engineering, deep learning is used to automate the process of extracting features and performing follow-on tasks (e.g., classification, segmentation, object detection).

Convolutional neural networks (CNN) are frequently used in a variety of applications. These models extract features through a series of convolutions, aggregation functions (e.g., max and average pooling) and non-linear activation functions to improve the representation of the data for better performance. Despite these powerful expressive features learned by a CNN, these models have some downfalls. CNNs have a spatial stationary assumption; therefore, CNNs are unable to account for changes in the local statistics (i.e., statistical texture features [peeples2022histogram]) for regions of the data [taigman2014deepface, liang2021patch]. Additionally, CNNs can lead to increased computational cost in terms of training time, memory requirements, and large datasets [juefei2017local]. CNNs effectively capture structural texture features while histogram layers focus on statistical texture features [peeples2022histogram].

To mitigate these issues associated with traditional and deep learning features, alternative models have been introduced that take inspiration from both approaches. A notable method is the local binary convolutional neural network (LBCNN) [juefei2017local]. The LBCNN used a novel architecture design that lead to a less expensive model that performed comparably to standard CNNs. LBCNN was inspired by LBP; however, the LBCNN generalizes the code generation and does not account for the aggregation operation of the original LBP.

Refer to caption — Figure 1: Features maps captured by LBP (top) and NLBP (bottom) from the FashionMNIST dataset. The proposed approach nearly reconstructs the original LBP feature. The LBP and NLBP encodings differ slightly due to the non-linearity: threshold (LBP) and ReLU (NLBP). Additionally, a square window ( $3\times 3$ ) is used for NEHD while LBP used a radial window for the local neighborhood. There are small magnitude differences of the histogram because of the soft binning approximation used in NLBP.

Histogram-based feature approaches (e.g., LBP and EHD) as well as CNNs are able to extract texture information [maenpaa2005texture, liu2019bow, hermann2020origins, peeples2021histogram] to represent the data. There are two types of textures: statistical and structural [peeples2021histogram, ji2022structural]. Both texture feature types are useful to represent the data to improve performance and features such as LBP use both within the feature extraction pipeline [maenpaa2005texture]. In this work, we present a generalized framework to learn histogram-based features in artificial neural networks: neural LBP (NLBP) and EHD (NEHD). The contributions of this new method are four-fold:

•

Flexibility: mitigation of issues associated with parameter selection
•

Expressibility: ability to change feature representation to best match problem (original LBP/EHD may not be best for problem)
•

Synergy: fusion of statistical and structural textures in network design for neural “engineered” features
•

Utility: the proposed layers can be used to learn histogram-based features and in future work, other powerful histogram-based features could be discovered through training this layer.

Each neural feature can be reconstructed using convolutional and histogram layers to learn the structural and statistical texture features respectively. Our proposed neural “engineered” features can closely reconstruct the LBP and EHD features as shown in Figures 1 and 2. The LBP feature consists of both structural (i.e., encoded pixel differences) and statistical (i.e., binary code frequency) texture information. To capture the structural relationships, following [liu2017local], sparse convolutional kernels can be used to capture the difference between the center and neighboring pixels in an image. After the convolution operation is performed, an activation function (e.g., sigmoid, ReLU) can be used instead of the threshold operation. After the input image has been encoded, a histogram layer can be used to adaptively aggregate the encoded features as opposed to the fixed binning of the original LBP feature.

Figure 1 shows a comparison of LBP and the proposed NLBP. To reconstruct the LBP feature, a modified ReLU activation function can be used. The standard ReLU function sets all negative values to $0$ (similar to the threshold function in LBP), but if the center pixel and neighbor pixel value are the same, the LBP threshold function will output a value of $1$ . To account for this in the NLBP, if the difference between a center and neighboring pixel is $0$ , the ReLU function was set to output a $1$ instead of a $0$ resulting in a similar encoding map as the original LBP. The histogram layer bins are initialized to the corresponding LBP code values ranging from $0$ to $255$ . The bin width was set to $3.75$ as this value will correspond to a narrow histogram bin.

The EHD feature also captures structural (i.e., edges) and statistical (i.e., max response frequency) texture information. The structural features can be captured using a convolutional layer with each filter corresponding to a Sobel kernel for each edge orientation, $0^{\circ}$ to $315^{\circ}$ . A total of eight edge kernels were used to capture horizontal, vertical, diagonal and anti-diagonal orientations (as in the standard EHD feature). To account for direction information, the bin center values are set to the maximum response of the convolution output from the edge kernels and the input data (i.e., maximum input value $\times$ ( $1+2+1$ )). By changing the bin center, the binning function will now account for direction information. The anti-direction will now have the lowest response to the corresponding correct direction (i.e., $0^{\circ}$ will have lowest response for $180^{\circ}$ ). The same bin width value ( $3.75$ ) was used for NEHD as in NLBP. The parameters (convolution kernels, bin centers, and bin widths) were fixed to ensure that the histogram layer can mimic the EHD feature as shown in Figure 2.

The spatial location of the responses align properly as shown in Figure 2. The no edge response of the NEHD and EHD feature maps are nearly the same. Since NEHD is a soft approximation of EHD, there will be some overlap between each edge orientation. As a result, there are small magnitude differences between each edge orientation. The bin width can be tuned to produce a more exact representation for each edge response. The EHD feature can be reconstructed with the histogram layer if 1) the weights on the input features represent edge kernels and 2) the bin centers are equal to the maximum response of each orientation.

II Related Work

II-A “Engineered” Histogram Features

Histograms are used throughout computer vision and machine learning as a method to aggregate intensity and/or feature values as well as relationships between neighboring inputs (e.g., edge orientation, pixel differences). Common histogram-based features include LBP [ojala1994performance], histogram of oriented gradients (HOG) [dalal2005histograms], EHD [frigui2008detection], and gray level co-occurrence matrix (GLCM) [haralick1973textural]. For this work, we will focus on LBP and EHD.

LBP Review

LBP is computed by first selecting a neighborhood of size $\mathcal{N}$ (commonly 3 $\times$ 3 or also using a radial window operation) and computing the difference between a center pixel, $x_{c}$ , and each of the pixels in the defined neighborhood, $x_{i}$ , in a gray scale image. The LBP feature value for a pixel in an image is computed by computing the product of the threshold, $\mathcal{T}$ , of the difference between the center pixel and neighbor pixel, and two raised to the power of the index of the neighbor (Equation 1). The threshold for assigning a “0” or a “1” is usually 0 (i.e., the difference needs to be positive) as shown in Equation 2, but other variants may use different thresholds [zhou2009face, jiang2016adaptive]:

LBP(x_{c})=\sum_{i=0}^{\mathcal{N}-1}\mathcal{T}(x_{i}-x_{c})2^{i}

(1)

\mathcal{T}(x_{i}-x_{c})=\begin{cases}1,(x_{i}-x_{c})\geq 0\\ 0,otherwise\end{cases}

(2)

A toy example of the LBP feature for a local region of an image is shown in Figure 3. Once the LBP code value is computed for each pixel in an image, a histogram of the code values with maximum LBP code value of $G$ , $\mathbfcal{H}\in\mathbb{R}^{G+1}$ , are summed over the spatial dimensions, $M\times N$ , of the image as shown in Equation 3 is used as the descriptor (following notation from [guo2010completed]):

\mathcal{H}_{g}=\sum_{i=0}^{M}\sum_{j=0}^{N}f(LBP(x_{ij}),g),g\in[0,G]

(3)

f(x,y)=\begin{cases}1,x=y\\ 0,otherwise\end{cases}

(4)

LBP is robust to monotonic changes in grayscale due to the threshold function [ojala1994performance, ojala1996comparative, humeau2019texture] (i.e., if the intensity values of an image change, the binary threshold will always return either two values, “0” or “1”, despite gray level changes). There are a plethora of extensions to LBP in the literature, and several works that summarize the novelty of each LBP based approach [liu2017local, fernandez2013texture].

EHD Review

The EHD feature records the frequency and orientation of intensity changes in an image [frigui2008detection]. A single channel input image is convolved with a set of $K$ filters (e.g., Sobel [sobel19683x3]) of size $M\times N$ to calculate the edge responses, $R\in\mathbb{R}^{M^{\prime}\times N^{\prime}\times K}$ . The edge responses are grouped (“binned”) together based on their orientation such as vertical, horizontal, diagonal, anti-diagonal, and no edges (isotropic) [frigui2008detection]. The five groups can include signed [peeples2018possibilistic, peeples2019comparison] or unsigned orientations [frigui2008detection] (e.g., 0^∘ and 180^∘ are both horizontal but different directions). To compute the histogram of edge orientations, $\mathbf{H}\in\mathbb{R}^{K+1}$ , the maximum edge responses are summed over the spatial dimensions for the $k^{th}$ element (i.e., bin) as shown in Equation 5:

H_{k}=\sum_{i=0}^{M^{\prime}}\sum_{j=0}^{N^{\prime}}\mathcal{B}_{k}(R_{ijk})

(5)

In order to only keep “strong” edge responses, a global threshold, $\theta_{G}$ , is selected and the voting function, $\mathcal{B}$ , transforms the edge responses to “votes” for each orientation as shown in Equations 6 and 7. The first case to consider is when an edge response is to be recorded. The edge response must be larger than any other edge orientation and greater than the threshold. If both conditions are met, then a “1” is assigned to the corresponding edge bin as shown in Equation 6:

\mathcal{B}(R_{ijk})=\begin{cases}0,\underset{l:k\neq l}{\exists}R_{ijl}>R_{% ijk}\text{ and }R_{ijk}\geq\theta_{G}\\ 0,R_{ijk}<\theta_{G}\\ 1,otherwise\end{cases}

(6)

The “no edge” case is when all edge responses are lower than the global threshold. When this condition is met, a value of “1” is assigned to the “no edge” or isotropic orientation

\mathcal{B}(R_{ijk+1})=\begin{cases}1,\forall k=1,...,K;R_{ijk}<\theta_{G}\\ 0,otherwise\end{cases}

(7)

An example of EHD features is shown in Figure 4. “Engineered” histogram-based features such as LBP and EHD have the same limitations as traditional feature engineering approaches. These limitations include manually tuning parameters and domain expertise [nanni2017handcrafted]. To overcome these issues, others have proposed to 1) combine traditional and deep learning feature extraction approaches and 2) design neural networks to extract “engineered” features through network design.

II-B Combination of Traditional and Deep Learning Approaches

Traditional feature extraction approaches and deep learning methods are used 1) separately and 2) together. Generally, there are five approaches for combining traditional and deep learning approaches [peeples2022connecting]: 1) take features from deep learning model and pass into traditional classifiers [scabini2019evaluating, sani2017learning, zhang2016svm], 2) “engineer” filters of CNN (may update through backpropagation or keep fixed) or emulate “engineered” features via the network design [bruna2013invariant, chan2015pcanet, malof2018improving, bianconi2019cnn, Su_2021_ICCV] , 3) pass “engineered” features into the network as input [muhammad2017tex, anwer2018binary, van2019feeding], 4) combine both “engineered” and CNN features for traditional classifiers [nguyen2018combining, paul2016combining, wu2016multi], and 5) texture encoding methods [cimpoi2015deep, zhang2017deep, song2017locally, xue2018deep, hu2019multi].

There are several problems with existing approaches for combining traditional and deep learning methods. First, an issue with incorporating “engineered” features are that the models cannot be trained in an end-to-end fashion because the “engineered” features are not updated as performance changes. Also, there are additional computational costs for training deep learning model(s) and separate classifier(s). Along with the increased computational costs, more parameter tuning is necessary for the “engineered” features, deep learning model(s), and classifier(s). Lastly, deep learning features perform well in practice, but these features are not easily explainable and/or interpretable as traditional features though there are ongoing efforts to “open” the black box [rudin2022interpretable].

III Method

III-A Neural “Engineered” Features Histogram Layer

The baseline histogram layer function in [peeples2021histogram] equally weights input features. As constructed, this implementation of the histogram layer will not be able to account for structural changes in texture information. For example, two binary images of a cross and checkerboard pattern will have the same distribution of pixels, but the arrangement of these pixels is different [peeples2022histogram]. The unweighted histogram layer will misidentify these two different structural texture types. To account for this, one can simply learn a weighting of the input features to account for structural differences. Our proposed neural “engineered” feature layer, $f$ , takes an input image or feature map(s), $\mathbf{X}$ , and applies two functions to extract texture information as shown in Equation 8:

f(\mathbf{X})=\phi\left(\sum_{\rho\in\mathcal{N}}\psi(\mathbf{x}_{\rho})\right)

(8)

where $\phi$ and $\psi$ represents the statistical and structural texture information respectively. Generally, $\psi$ is implemented as a local feature extractor in a given neighborhood $\mathcal{N}$ (e.g., $3\times 3$ edge kernel) and $\phi$ is selected as a global or local operation.

$\phi$ represents the histogram layer introduced in [peeples2021histogram]. The histogram layer output is shown in Equation 9. The normalized frequency count, $\phi_{rcbk}$ , is computed with a sliding window of size $S\times T$ and the binning operation for a histogram value in the $k^{th}$ channel of the input $x$ :

\phi_{rcbk}=\cfrac{1}{ST}\sum_{s=1}^{S}\sum_{t=1}^{T}\exp\left(-\gamma_{bk}^{2% }\left(\sum_{\rho\in\mathcal{N}}\psi_{bk}(\mathbf{x}_{\rho})-\mu_{bk}\right)^{% 2}\right)

(9)

where $r$ and $c$ are spatial dimensions of the histogram feature maps. The histogram layer is used to aggregate the structural texture information from the input. We detail the structural texture feature extraction process for NEHD and NLBP in the next two sections.

III-B Neural Edge Histogram Descriptor

NEHD captures structural texture information by accounting for edge information. $\psi^{NEHD}_{bk}$ is shown in Equation 10 where the input, $x$ , is weighted by $w_{mnk}$ corresponding to a value in an $M\times N$ kernel for each input channel is defined as:

\psi^{NEHD}_{bk}=\sum_{m=1}^{M}\sum_{n=1}^{N}w_{mnk}x_{r+s+m,c+t+n,k}

(10)

Equation 10 corresponds to the edge responses, $\mathcal{R}$ , discussed in Equation 5. Once the structural texture is captured, the output from $\psi^{NEHD}$ , is then passed into the histogram layer. Unlike the baseline EHD feature, the NEHD can be trained end-to-end to update both the structural and statistical texture representation.

The NEHD edge responses can be easily implemented using a convolutional layer to capture the edge feature maps as shown in Figure 5(a). The histogram layer is also implemented using two convolutional layers for the binning of the edge responses. For the “no-edge” orientation, two approaches can be used: thresholding and convolution. Similar to EHD, after the edge responses are computed from Equation 10, a threshold can be applied to detect if there is a “strong” edge present. The “no-edge” map could then be simply concatenated to the edge feature maps from the convolutional layer. The second approach is to learn the thresold operation by using a $1\times 1$ convolution to map the edge response maps to a single channel. A non-linearity (e.g., sigmoid) is then applied to the single feature map to learn a differentiable threshold operation. We investigate both approaches in Section V.

III-C Neural Local Binary Pattern

NLBP captures structural texture information by accounting for pixel differences. Similar to the LBCNN approach [liu2017local], $\psi^{NLBP}_{bk}$ is shown in Equation 11 where the input, $x$ , is weighted by $w_{mnk}$ corresponding to a value in an $M\times N$ kernel for each input channel is

\psi^{NLBP}_{bk}=\sum_{z=1}^{Z}\sigma\left(\sum_{m=1}^{M}\sum_{n=1}^{N}w_{mnk}% x_{r+s+m,c+t+n,k}\right)\mathcal{V}_{kz}

(11)

where $\sigma$ is an activation function (e.g., sigmoid, ReLU) and $\mathcal{V}_{kz}$ is a learnable $1\times 1$ convolution. Equation 11 parallels Equations 1 and 2. The difference between a neighbor pixel and center pixel can be implemented using sparse convolution kernels where the weight on the center pixel is $-1$ , a neighbor pixel is 1, and all other values in the kernel are $0$ as shown in Figure 5(b). This operation can be extended to other neighbors by rotating the kernel to process all pixels in the defined neighborhood. After the difference is computed, instead of the threshold operation, an activation function such as sigmoid can be used to map the values between $0$ and $1$ . The final step is to multiply the “thresholded” difference maps by the binary base. This can also be implemented using the $1\times 1$ convolution where the base value can be learned and not fixed to be a power of $2$ .

IV Experimental Setup

Table I: Fashion MNIST NEHD initialization and parameter learning mean test accuracy with

\pm

1 standard deviation across three experimental runs is shown. The baseline EHD approach had an average test accuracy of

86.94\pm 0.00

. NEHD was initialized using a) EHD or b) randomly. The “no-edge” was extracted by a) learned using a convolution layer or b) threshold of edge response maps. The NEHD layer consisted of structural (convolution kernels) and statistical (histogram layer) texture features. The impact of updating both texture representations was captured across the columns of the table. The model with the best average test accuracy is bolded.

Random Initialization	No Edge Threshold	Learn Both	Learn Structural	Learn Statistical	Fix Both
		89.74 $\pm$ 0.12	89.09 $\pm$ 0.24	88.81 $\pm$ 0.05	83.82 $\pm$ 0.26
	✓	89.51 $\pm$ 0.28	89.28 $\pm$ 0.24	88.35 $\pm$ 0.02	83.36 $\pm$ 0.06
✓		89.13 $\pm$ 0.23	88.87 $\pm$ 0.27	88.50 $\pm$ 0.17	78.08 $\pm$ 3.14
✓	✓	89.16 $\pm$ 0.23	88.52 $\pm$ 0.22	88.28 $\pm$ 0.30	76.91 $\pm$ 2.54

The experiments are divided into two main parts: ablation study and feature comparison. All experiments used a shallow model that consisted of the input image, feature extractor (EHD, LBP, NEHD, and NLBP), and a fully connected layer. We chose this simple model to focus on the discriminative power of each feature using a linear classifier. In the ablation study, the FashionMNIST [xiao2017fashion] dataset is used to investigate the impact of a) initialization and b) parameter learning in Section V-A. For the initialization experiments, the impact of initializing the layer randomly or with the baseline feature setting (i.e., NEHD kernels initialized with Sobel kernels and threshold operation) was evaluated. The feature learning focused on the contribution of learning a) statistical texture features, b) structural texture features, and c) both. Additionally, texture features such LBP and EHD have been typically applied to grayscale or single channel inputs [porebski2008haralick]. The Plant Root Minirhizotron Imagery (PRMI) dataset [xu2022prmi] consists of RGB plant images. For both NEHD and NLBP, different multichannel approaches were investigated in Section V-B: treat each channel independently, use $1\times 1$ convolution to map an RGB image to single channel input, and converting an RGB image to grayscale.

An extensive comparison is performed across three datasets (FashionMNIST [xiao2017fashion], PRMI [xu2022prmi], BloodMNIST [medmnistv1, medmnistv2, bloodmnist]) with the baseline EHD and LBP, LBP variants, NEHD, and NLBP as discussed in Section V-C. EHD was implemented using Pytorch. In order to compare our proposed method with the original LBP and LBP variants, the delayed function from the Dask package [rocklin2015dask] was used to parallelize the Sklearn LBP feature extraction for each mini-batch of images. Each mini-batch is passed through a loop that holds each image to be computed in memory. Then, the Dask function produces a new list where the objects are all computed at the same time, and this list is converted to a tensor of the appropriate shape. This method allowed for faster processing speeds to compute the LBP of a mini-batch of images and integrate the features into the Pytorch framework.

BloodMNIST and PRMI have dedicated training, validation, and test partitions. For FashionMNIST, the training data was divided into a 90/10 train and validation split. The trained model was then applied to the holdout test set. A subset of the PRMI (following [peeples2022divergence]) was chosen using the following four classes: cotton, sunflower, papaya, and switch grass. No data augmentation was used for FashionMNIST. For BloodMNIST and PRMI, the data augmentation procedure followed [peeples2021histogram, xue2018deep]. The images resized to $128\times 128$ with random and center crops of $112\times 112$ as input into the models.

The experimental parameters were the following: 100 epochs, batch size of 64, Adam optimization, and initial learning rate of $.001$ for EHD, NEHD, and NLBP. The initial learning rate for the LBP baseline and variants was $0.01$ . The EHD and NEHD parameters were set to the following: $3\times 3$ edge kernel, $5\times 5$ window size to aggregate bin counts, threshold of $.9$ , normalization of count (i.e., average pooling), and normalized kernel values. Both EHD and NEHD were set to extract eight edge orientation and ”no edge” resulting in a $9$ -bin histogram.

The baseline LBP and variants used a radius of 1 and neighborhood size ( $P$ ) set to $8$ . Each LBP approach used the default number of bins: baseline LBP (256), uniform LBP (59), nri-uniform LBP (59), ror LBP (256), and var LBP (256). NLBP used the following settings: $3\times 3$ pixel difference kernel, $5\times 5$ window size to aggregate bin counts, ReLU activation function, normalization of count (i.e., average pooling), and 16-bin histogram. The ReLU activation function was used in NLBP to promote better learning [liu2017local] and the input range of the data was between $0$ and $1$ (range of the pixel difference is $-1$ and $1$ resulting in the ReLU function map** the data between $0$ and $1$ ). A total of five experimental runs were completed for each model on the PRMI dataset while three experimental runs were used for BloodMNIST and FashionMNIST for each configuration. Experiments were completed on an A100 GPU.

V Results and Discussion

V-A Ablation Study: Initialization and Parameter Learning

Table II: Fashion MNIST NLBP initialization and parameter learning mean test accuracy with

\pm

1 standard deviation across three experimental runs is shown. The baseline LBP approach had an average test accuracy of

71.66\pm 0.02

. NLBP was initialized using a) LBP or b) randomly. The NLBP base power was either a) fixed (power of

2

) or b) learned. The NLBP layer consisted of structural (convolution kernels) and statistical (histogram layer) texture features. The impact of updating both texture representations was captured across the columns of the table. The model with the best average test accuracy is bolded.

Random Initialization	Fixed Base	Learn Both	Learn Structural	Learn Statistical	Fix Both
		85.54 $\pm$ 0.07	85.70 $\pm$ 0.01	86.63 $\pm$ 0.02	79.28 $\pm$ 0.00
	✓	85.52 $\pm$ 0.04	85.56 $\pm$ 0.01	86.68 $\pm$ 0.02	78.65 $\pm$ 0.00
✓		87.44 $\pm$ 0.40	86.67 $\pm$ 0.50	85.20 $\pm$ 0.03	81.84 $\pm$ 0.51
✓	✓	87.50 $\pm$ 0.30	81.53 $\pm$ 0.42	80.74 $\pm$ 0.48	36.05 $\pm$ 1.60

V-A1 NEHD

The results of the NEHD initialization and parameter learning are shown in Table I. The proposed NEHD layer was robust to the random or EHD initialization as shown by overlap** error bars for each test accuracy except for when the structural and statistical textures were fixed. When the layer is fixed, the EHD initialization performance was more statistically significant than random initialization. Another interesting observation is that for most learnable texture features, the EHD initialization led to marginally better performance. Edges are important features that distinguish the classes in FashionMNIST as there is no background in the images and only the articles of clothing are present. Therefore, the edge “profiles” for each clothing item can be a powerful discriminator between the different classes. This analysis is further validated as the baseline EHD feature achieved an average accuracy higher ( $86.94$ ) than all of the fixed statistical and structural texture features models with random or EHD initialization. The EHD initialization for NEHD is not an exactly equal to EHD due to the soft binning approximation of the histogram layer.

The most difficult classes to distinguish were the “T-shirt/top” and “shirt.” The baseline EHD had difficulty differentiating between these two classes as the representation is fixed and cannot adapt. However, our proposed NEHD layer outperformed the baseline EHD feature for learnable settings: learn structural, statistical, or both texture features. For the structural texture features, the edge kernel can be adapted from the initial Sobel kernels to account for more details regarding the structure of each clothing item. The statistical features assist with the soft binning approximation to mitigate small intra-class variations for each clothing item. As noted in Table I, the combination of learning both texture feature representations achieved the best performance on the FashionMNIST test set.

V-A2 NLBP

The results of the ablation study on NLBP are shown in Table II. The NLBP feature was robust to either random or LBP initialization except for the case when the kernel was randomly initialized and the base was fixed to be a power of $2$ . This result matches similar analysis from the [liu2017local] that shows that fixing the base power may limit the generalization ability of the feature. Another interesting observation is that the randomly initialized models performed slightly better for every setting except when learning statistical features and fixing both texture feature parameters with a fixed base. For the statistical features, the initialization of the histogram layer to meaningful bin centers and widths will assist in the learning process of the model. For the structural textures, randomly initializing the kernels lead to an improvement in performance except for when the base power was fixed. This result indicates that the initialization of the kernel used to capture the relationship between neighboring pixels is important when capturing the structural texture information. The LBP initialized the NLBP feature with sparse kernels and this may cause limit the learning ability of the model.

Similar to NEHD, maximal performance is achieved when both structural and statistical texture features are jointly learned. When LBP initialization is used, updating the statistical features achieved higher performance than only updating the structural features. This result is intuitive as a limitation of the original LBP is that the histogram can be sparse if some of the 256 LBP codes are not well represented. With the NLBP approach, the aggregation of the LBP encoding or bit map can be adjusted by updating the bin centers and widths to maximize the performance on different tasks such as image classification. For NLBP, all learning settings (i.e., updating structural and/or statistical textures) statistically significantly outperformed the baseline LBP ( $71.66$ average test accuracy) feature and fixing all parameters demonstrating the utility of the proposed neural “engineered” feature.

V-B Ablation Study: Multichannel processing

Table III: Multichannel processing approaches for NEHD and NLBP on the PRMI dataset. The mean test accuracy with

\pm

1 standard deviation across five experimental runs is shown. The independent setting applied the neural “engineered” feature to each channel separately,

1\times 1

convolution was a learnable map** of the three channel input to a single channel, and grayscale was converting the image to a single channel intensity image. The NLBP and NEHD fusion approach with the best average test accuracy is bolded for each feature.

Model	Independent	$1\times 1$ Convolution	Grayscale
NEHD	89.92 $\pm$ 0.21	$88.18\pm 5.60$	$89.45\pm 1.56$
NLBP	91.17 $\pm$ 1.06	$91.08\pm 3.00$	$88.76\pm 6.44$

EHD and LBP are generally applied to grayscale or single channel inputs (though efforts have been made to extend these features to multichannel inputs [humeau2022color]). For NEHD and NLBP, we explored three approaches for applying these neural “engineered” features to RGB images like the PRMI dataset [xu2022prmi]. One approach was to treat each channel independently and extract the neural “engineered” features similar to [peeples2021histogram]. For example, with NEHD, the feature can be applied to the red, green and blue channel. Each input channel would have $D$ edge orientation histogram feature maps (no edge included). The resulting feature maps would be concatenated together before the next layer in an artificial neural network. Two other approches were a) $1\times 1$ convolution and b) grayscale conversion would take the RGB image and convert the image to a single channel. The grayscale conversion would work an RGB input image, but the $1\times 1$ convolution approach would generalize for input features maps of varying channel dimensions.

Table IV: Comparison results of our proposed neural “engineered” features and “engineered” features across different datasets. The mean test accuracy with

\pm

1 standard deviation across multiple experimental runs is shown. The model with the highest test accuracy for each dataset is bolded.

Method	FashionMNIST	PRMI	BloodMNIST
EHD	$86.94\pm 0.00$	$54.58\pm 6.62$	$66.33\pm 0.07$
LBP	$71.66\pm 0.02$	$69.68\pm 0.12$	$43.25\pm 0.07$
LBP ROR	$51.10\pm 0.12$	$70.70\pm 0.11$	$40.32\pm 0.07$
LBP Uniform	$28.26\pm 0.01$	$67.17\pm 0.00$	$33.41\pm 0.06$
LBP NRI Uniform	$51.07\pm 0.02$	$71.30\pm 0.03$	$33.92\pm 0.16$
LBP Var	$10.00\pm 0.00$	$59.93\pm 0.11$	$19.47\pm 0.00$
NEHD (ours)	89.74 $\pm$ 0.12	$89.92\pm 0.21$	83.60 $\pm$ 0.35
NLBP (ours)	$87.50\pm 0.30$	91.17 $\pm$ 1.06	$76.06\pm 1.47$

The results of our multichannel processing for the best NLBP and NEHD from Section V-A are shown in Table III. From the results, independently processing each channel achieved the highest average test accuracy. Independently processing each channel leads to the most information so this result is intuitive. However, the number of features will scale as the number of input feature channels are increased. When observing the results of converting to gray scale and applying a $1\times 1$ convolution, the average test accuracy is comparable to the independent processing approach, though these approaches have increased variability indicated by the larger standard deviation when compared to the standard deviation of the independent processing approach. When integrating the neural “engineered” features into a deeper network, the $1\times 1$ approach may be the most ideal approach as this can reduce the number of features extracted when compared to the independent fusion approach. The grayscale conversion is limited to RGB inputs.

V-C Comparison of Neural and “Engineered” Features

The last set of experiments compared our proposed NEHD and NLBP with the baseline EHD and LBP variants across three different datasets: FashionMNIST [xiao2017fashion], PRMI [xu2022prmi], BloodMNIST [medmnistv1, medmnistv2, bloodmnist]. The results of the comparisons are shown in Table IV. Overall, both NEHD and NLBP outperform the “engineered” features across all three datasets. For FashionMNIST, as discussed in Section V-A, edges are important to distinguish the different articles of clothes resulting in EHD achieving the best performance for “engineered” features. However, our NEHD and NLBP achieve statistically significantly (i.e., no overlap** error bars) improved performance. PRMI and BloodMNIST are consists of RGB images. Each feature was applied to each channel independently as a fair comparison between each approach. The PRMI results show that NLBP slightly outperformed NEHD. For the PRMI datasets, edge information is important to identify the roots, but there are some images have some illumination differences. LBP is robust to illumination variations (e.g., monotonic changes) [humeau2019texture], resulting in NLBP retaining a similar property when extracting texture features from the images.

The NEHD and NLBP are compared with the best EHD and LBP variant qualitatively (Figure 6) and quantitatively (Figure 7). As seen in the TSNE visuals in Figure 6, the features learned by NEHD and NLBP appear more compact in the 2D projection than the “engineered” features. An important aspect of both approaches is the aggregation performed by the histogram layer. The soft binning approach provides a way to account for intra-class variations that can be present within the dataset [peeples2021histogram]. As noted in the confusion matrices in Figure 7, the neural “engineered” features outperform the “engineered” features. For some of the PRMI images, the root are occluded with background information (e.g., soil, image artifacts) and the fixed representation of the “engineered” features is limited in mitigating the impact of these confusers in the images. However, our neural “engineered” features can change the texture representation to improve classification performance.

BloodMNIST was the most difficult dataset for the shallow network used since the number of classes was eight (more than PRMI) in comparison and the blood cells visually have small differences between the different classes. However, our proposed features achieved statistically significant higher test accuracy than the “engineered” features. Texture information is vital in several biomedical applications [liu2019bow]. Our proposed approach can be applied to several biomedical tasks and integrated into deeper networks to achieve improved peformance on the BloodMNIST dataset.

VI Conclusion

In this work, we proposed neural handcrafted features by using histogram layers to aggregate structural texture features. We demonstrate the approach by introducing two neural handcrafted features: NEHD and NLBP. Our results across benchmark (FashionMNIST) and real-world (PRMI and BloodMNIST) datasets showcase the potential use of these features. The general framework for neural handcrafted features can be used for other texture feature approaches such as Haralick texture features [haralick1973textural], histogram of oriented gradients [dalal2005histograms] and several more statistical and structural texture feature extraction methods [liu2019bow, humeau2019texture]. Future work includes integrating the neural handcrafted layer(s) into deeper networks, improving the multichannel processing approach, and designing new objective functions to maximize statistical and/or structural texture information.

Acknowledgment

This material is based upon work supported by the National Science Foundation Graduate Research Fellowship under Grant No. DGE-1842473 and by the Office of Naval Research grant N00014-16-1-2323. The views and opinions of authors expressed herein do not necessarily state or reflect those of the United States Government or any agency thereof. Portions of this research were conducted with the advanced computing resources provided by Texas A&M High Performance Research Computing.

\printbibliography