Histogram Layers for Neural “Engineered” Features

Joshua Peeples, , Salim Al Kharsa, Luke Saleh,  and Alina Zare J. Peeples is an Assistant Professor and S. Al Kharsa is an undergraduate student in the Department of Electrical and Computer Engineering, Texas A&M University, College Station, TX, 77843, USA, e-mails: [email protected] and [email protected]. Saleh is an undergraduate student and A. Zare is a Professor in the Department of Electrical and Computer Engineering, University of Florida, Gainesville, FL, 32608, USA, emails: [email protected] and [email protected].
Abstract

In the computer vision literature, many effective histogram-based features have been developed. These “engineered” features include local binary patterns and edge histogram descriptors among others and they have been shown to be informative features for a variety of computer vision tasks. In this paper, we explore whether these features can be learned through histogram layers embedded in a neural network and, therefore, be leveraged within deep learning frameworks. By using histogram features, local statistics of the feature maps from the convolution neural networks can be used to better represent the data. We present neural versions of local binary pattern and edge histogram descriptors that jointly improve the feature representation and perform image classification. Experiments are presented on benchmark and real-world datasets. Our code is publicly available111https://github.com/Advanced-Vision-and-Learning-Lab/NEHD_NLBP.

Index Terms:
Deep Learning, Histograms, Feature Learning, Feature Engineering.

I Introduction

Before the popularity of deep learning, feature engineering often (and still does) played a vital role in the fields of computer vision and machine learning. Examples of engineered features include local binary patterns [ojala1994performance] (LBP) and edge histogram descriptors (EHD) [manjunath2002introduction]. Extracting and selecting the best features (i.e., feature engineering) overall was difficult. Many of these engineered features were created through the use of histograms, but there were many parameters (such as bin centers and widths) that were difficult to tune. As an alternative to the difficult and time-consuming process of feature engineering, deep learning is used to automate the process of extracting features and performing follow-on tasks (e.g., classification, segmentation, object detection).

Convolutional neural networks (CNN) are frequently used in a variety of applications. These models extract features through a series of convolutions, aggregation functions (e.g., max and average pooling) and non-linear activation functions to improve the representation of the data for better performance. Despite these powerful expressive features learned by a CNN, these models have some downfalls. CNNs have a spatial stationary assumption; therefore, CNNs are unable to account for changes in the local statistics (i.e., statistical texture features [peeples2022histogram]) for regions of the data [taigman2014deepface, liang2021patch]. Additionally, CNNs can lead to increased computational cost in terms of training time, memory requirements, and large datasets [juefei2017local]. CNNs effectively capture structural texture features while histogram layers focus on statistical texture features [peeples2022histogram].

To mitigate these issues associated with traditional and deep learning features, alternative models have been introduced that take inspiration from both approaches. A notable method is the local binary convolutional neural network (LBCNN) [juefei2017local]. The LBCNN used a novel architecture design that lead to a less expensive model that performed comparably to standard CNNs. LBCNN was inspired by LBP; however, the LBCNN generalizes the code generation and does not account for the aggregation operation of the original LBP.

Refer to caption
Figure 1: Features maps captured by LBP (top) and NLBP (bottom) from the FashionMNIST dataset. The proposed approach nearly reconstructs the original LBP feature. The LBP and NLBP encodings differ slightly due to the non-linearity: threshold (LBP) and ReLU (NLBP). Additionally, a square window (3×3333\times 33 × 3) is used for NEHD while LBP used a radial window for the local neighborhood. There are small magnitude differences of the histogram because of the soft binning approximation used in NLBP.
Refer to caption
Figure 2: Features maps captured by EHD (top row) and NEHD (bottom row) from the FashionMNIST dataset. The proposed approach closely reconstructs the original EHD feature. There are small magnitude differences because of the soft binning approximation used in NEHD.

Histogram-based feature approaches (e.g., LBP and EHD) as well as CNNs are able to extract texture information [maenpaa2005texture, liu2019bow, hermann2020origins, peeples2021histogram] to represent the data. There are two types of textures: statistical and structural [peeples2021histogram, ji2022structural]. Both texture feature types are useful to represent the data to improve performance and features such as LBP use both within the feature extraction pipeline [maenpaa2005texture]. In this work, we present a generalized framework to learn histogram-based features in artificial neural networks: neural LBP (NLBP) and EHD (NEHD). The contributions of this new method are four-fold:

  • Flexibility: mitigation of issues associated with parameter selection

  • Expressibility: ability to change feature representation to best match problem (original LBP/EHD may not be best for problem)

  • Synergy: fusion of statistical and structural textures in network design for neural “engineered” features

  • Utility: the proposed layers can be used to learn histogram-based features and in future work, other powerful histogram-based features could be discovered through training this layer.

Each neural feature can be reconstructed using convolutional and histogram layers to learn the structural and statistical texture features respectively. Our proposed neural “engineered” features can closely reconstruct the LBP and EHD features as shown in Figures 1 and 2. The LBP feature consists of both structural (i.e., encoded pixel differences) and statistical (i.e., binary code frequency) texture information. To capture the structural relationships, following [liu2017local], sparse convolutional kernels can be used to capture the difference between the center and neighboring pixels in an image. After the convolution operation is performed, an activation function (e.g., sigmoid, ReLU) can be used instead of the threshold operation. After the input image has been encoded, a histogram layer can be used to adaptively aggregate the encoded features as opposed to the fixed binning of the original LBP feature.

Figure 1 shows a comparison of LBP and the proposed NLBP. To reconstruct the LBP feature, a modified ReLU activation function can be used. The standard ReLU function sets all negative values to 00 (similar to the threshold function in LBP), but if the center pixel and neighbor pixel value are the same, the LBP threshold function will output a value of 1111. To account for this in the NLBP, if the difference between a center and neighboring pixel is 00, the ReLU function was set to output a 1111 instead of a 00 resulting in a similar encoding map as the original LBP. The histogram layer bins are initialized to the corresponding LBP code values ranging from 00 to 255255255255. The bin width was set to 3.753.753.753.75 as this value will correspond to a narrow histogram bin.

The EHD feature also captures structural (i.e., edges) and statistical (i.e., max response frequency) texture information. The structural features can be captured using a convolutional layer with each filter corresponding to a Sobel kernel for each edge orientation, 0superscript00^{\circ}0 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT to 315superscript315315^{\circ}315 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT. A total of eight edge kernels were used to capture horizontal, vertical, diagonal and anti-diagonal orientations (as in the standard EHD feature). To account for direction information, the bin center values are set to the maximum response of the convolution output from the edge kernels and the input data (i.e., maximum input value ×\times× (1+2+11211+2+11 + 2 + 1)). By changing the bin center, the binning function will now account for direction information. The anti-direction will now have the lowest response to the corresponding correct direction (i.e., 0superscript00^{\circ}0 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT will have lowest response for 180superscript180180^{\circ}180 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT). The same bin width value (3.753.753.753.75) was used for NEHD as in NLBP. The parameters (convolution kernels, bin centers, and bin widths) were fixed to ensure that the histogram layer can mimic the EHD feature as shown in Figure 2.

The spatial location of the responses align properly as shown in Figure 2. The no edge response of the NEHD and EHD feature maps are nearly the same. Since NEHD is a soft approximation of EHD, there will be some overlap between each edge orientation. As a result, there are small magnitude differences between each edge orientation. The bin width can be tuned to produce a more exact representation for each edge response. The EHD feature can be reconstructed with the histogram layer if 1) the weights on the input features represent edge kernels and 2) the bin centers are equal to the maximum response of each orientation.

II Related Work

Refer to caption
Figure 3: Example of LBP feature value calculation. After a local neighborhood is selected (e.g., 3×3333\times 33 × 3), the difference between the center and neighboring pixels is computed. If the difference is greater than or equal to a chosen threshold (e.g., 00), then a value of “1” is returned for that neighbor. The next step is to encode the binary position of each neighbor (i.e. 20superscript202^{0}2 start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT, 21superscript212^{1}2 start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT, … , 27superscript272^{7}2 start_POSTSUPERSCRIPT 7 end_POSTSUPERSCRIPT). An elementwise multiplication and sum operation is performed between the threshold and binary position values. The output of this operation is the LBP feature for the local patch. The values from each local patch of the input image is then aggregated via a histogram to generate the output LBP histogram vector.

II-A “Engineered” Histogram Features

Histograms are used throughout computer vision and machine learning as a method to aggregate intensity and/or feature values as well as relationships between neighboring inputs (e.g., edge orientation, pixel differences). Common histogram-based features include LBP [ojala1994performance], histogram of oriented gradients (HOG) [dalal2005histograms], EHD [frigui2008detection], and gray level co-occurrence matrix (GLCM) [haralick1973textural]. For this work, we will focus on LBP and EHD.

LBP Review

LBP is computed by first selecting a neighborhood of size 𝒩𝒩\mathcal{N}caligraphic_N (commonly 3 ×\times× 3 or also using a radial window operation) and computing the difference between a center pixel, xcsubscript𝑥𝑐x_{c}italic_x start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT, and each of the pixels in the defined neighborhood, xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, in a gray scale image. The LBP feature value for a pixel in an image is computed by computing the product of the threshold, 𝒯𝒯\mathcal{T}caligraphic_T, of the difference between the center pixel and neighbor pixel, and two raised to the power of the index of the neighbor (Equation 1). The threshold for assigning a “0” or a “1” is usually 0 (i.e., the difference needs to be positive) as shown in Equation 2, but other variants may use different thresholds [zhou2009face, jiang2016adaptive]:

LBP(xc)=i=0𝒩1𝒯(xixc)2i𝐿𝐵𝑃subscript𝑥𝑐superscriptsubscript𝑖0𝒩1𝒯subscript𝑥𝑖subscript𝑥𝑐superscript2𝑖LBP(x_{c})=\sum_{i=0}^{\mathcal{N}-1}\mathcal{T}(x_{i}-x_{c})2^{i}italic_L italic_B italic_P ( italic_x start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_N - 1 end_POSTSUPERSCRIPT caligraphic_T ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) 2 start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT (1)
𝒯(xixc)={1,(xixc)00,otherwise𝒯subscript𝑥𝑖subscript𝑥𝑐cases1subscript𝑥𝑖subscript𝑥𝑐0otherwise0𝑜𝑡𝑒𝑟𝑤𝑖𝑠𝑒otherwise\mathcal{T}(x_{i}-x_{c})=\begin{cases}1,(x_{i}-x_{c})\geq 0\\ 0,otherwise\end{cases}caligraphic_T ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) = { start_ROW start_CELL 1 , ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) ≥ 0 end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL 0 , italic_o italic_t italic_h italic_e italic_r italic_w italic_i italic_s italic_e end_CELL start_CELL end_CELL end_ROW (2)

A toy example of the LBP feature for a local region of an image is shown in Figure 3. Once the LBP code value is computed for each pixel in an image, a histogram of the code values with maximum LBP code value of G𝐺Gitalic_G, 𝒢superscript𝒢\mathbfcal{H}\in\mathbb{R}^{G+1}roman_ℋ ∈ roman_ℛ start_POSTSUPERSCRIPT roman_𝒢 ⇓ ∞ end_POSTSUPERSCRIPT, are summed over the spatial dimensions, M×N𝑀𝑁M\times Nitalic_M × italic_N, of the image as shown in Equation 3 is used as the descriptor (following notation from [guo2010completed]):

g=i=0Mj=0Nf(LBP(xij),g),g[0,G]formulae-sequencesubscript𝑔superscriptsubscript𝑖0𝑀superscriptsubscript𝑗0𝑁𝑓𝐿𝐵𝑃subscript𝑥𝑖𝑗𝑔𝑔0𝐺\mathcal{H}_{g}=\sum_{i=0}^{M}\sum_{j=0}^{N}f(LBP(x_{ij}),g),g\in[0,G]caligraphic_H start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_f ( italic_L italic_B italic_P ( italic_x start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) , italic_g ) , italic_g ∈ [ 0 , italic_G ] (3)
f(x,y)={1,x=y0,otherwise𝑓𝑥𝑦cases1𝑥𝑦otherwise0𝑜𝑡𝑒𝑟𝑤𝑖𝑠𝑒otherwisef(x,y)=\begin{cases}1,x=y\\ 0,otherwise\end{cases}italic_f ( italic_x , italic_y ) = { start_ROW start_CELL 1 , italic_x = italic_y end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL 0 , italic_o italic_t italic_h italic_e italic_r italic_w italic_i italic_s italic_e end_CELL start_CELL end_CELL end_ROW (4)

LBP is robust to monotonic changes in grayscale due to the threshold function [ojala1994performance, ojala1996comparative, humeau2019texture] (i.e., if the intensity values of an image change, the binary threshold will always return either two values, “0” or “1”, despite gray level changes). There are a plethora of extensions to LBP in the literature, and several works that summarize the novelty of each LBP based approach [liu2017local, fernandez2013texture].

EHD Review

The EHD feature records the frequency and orientation of intensity changes in an image [frigui2008detection]. A single channel input image is convolved with a set of K𝐾Kitalic_K filters (e.g., Sobel [sobel19683x3]) of size M×N𝑀𝑁M\times Nitalic_M × italic_N to calculate the edge responses, RM×N×K𝑅superscriptsuperscript𝑀superscript𝑁𝐾R\in\mathbb{R}^{M^{\prime}\times N^{\prime}\times K}italic_R ∈ blackboard_R start_POSTSUPERSCRIPT italic_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT × italic_N start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT × italic_K end_POSTSUPERSCRIPT. The edge responses are grouped (“binned”) together based on their orientation such as vertical, horizontal, diagonal, anti-diagonal, and no edges (isotropic) [frigui2008detection]. The five groups can include signed [peeples2018possibilistic, peeples2019comparison] or unsigned orientations [frigui2008detection] (e.g., 0 and 180 are both horizontal but different directions). To compute the histogram of edge orientations, 𝐇K+1𝐇superscript𝐾1\mathbf{H}\in\mathbb{R}^{K+1}bold_H ∈ blackboard_R start_POSTSUPERSCRIPT italic_K + 1 end_POSTSUPERSCRIPT, the maximum edge responses are summed over the spatial dimensions for the kthsuperscript𝑘𝑡k^{th}italic_k start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT element (i.e., bin) as shown in Equation 5:

Hk=i=0Mj=0Nk(Rijk)subscript𝐻𝑘superscriptsubscript𝑖0superscript𝑀superscriptsubscript𝑗0superscript𝑁subscript𝑘subscript𝑅𝑖𝑗𝑘H_{k}=\sum_{i=0}^{M^{\prime}}\sum_{j=0}^{N^{\prime}}\mathcal{B}_{k}(R_{ijk})italic_H start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT caligraphic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_R start_POSTSUBSCRIPT italic_i italic_j italic_k end_POSTSUBSCRIPT ) (5)

In order to only keep “strong” edge responses, a global threshold, θGsubscript𝜃𝐺\theta_{G}italic_θ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT, is selected and the voting function, \mathcal{B}caligraphic_B, transforms the edge responses to “votes” for each orientation as shown in Equations 6 and 7. The first case to consider is when an edge response is to be recorded. The edge response must be larger than any other edge orientation and greater than the threshold. If both conditions are met, then a “1” is assigned to the corresponding edge bin as shown in Equation 6:

(Rijk)={0,l:klRijl>Rijk and RijkθG0,Rijk<θG1,otherwisesubscript𝑅𝑖𝑗𝑘cases0:𝑙𝑘𝑙subscript𝑅𝑖𝑗𝑙subscript𝑅𝑖𝑗𝑘 and subscript𝑅𝑖𝑗𝑘subscript𝜃𝐺otherwise0subscript𝑅𝑖𝑗𝑘subscript𝜃𝐺otherwise1𝑜𝑡𝑒𝑟𝑤𝑖𝑠𝑒otherwise\mathcal{B}(R_{ijk})=\begin{cases}0,\underset{l:k\neq l}{\exists}R_{ijl}>R_{% ijk}\text{ and }R_{ijk}\geq\theta_{G}\\ 0,R_{ijk}<\theta_{G}\\ 1,otherwise\end{cases}caligraphic_B ( italic_R start_POSTSUBSCRIPT italic_i italic_j italic_k end_POSTSUBSCRIPT ) = { start_ROW start_CELL 0 , start_UNDERACCENT italic_l : italic_k ≠ italic_l end_UNDERACCENT start_ARG ∃ end_ARG italic_R start_POSTSUBSCRIPT italic_i italic_j italic_l end_POSTSUBSCRIPT > italic_R start_POSTSUBSCRIPT italic_i italic_j italic_k end_POSTSUBSCRIPT and italic_R start_POSTSUBSCRIPT italic_i italic_j italic_k end_POSTSUBSCRIPT ≥ italic_θ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL 0 , italic_R start_POSTSUBSCRIPT italic_i italic_j italic_k end_POSTSUBSCRIPT < italic_θ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL 1 , italic_o italic_t italic_h italic_e italic_r italic_w italic_i italic_s italic_e end_CELL start_CELL end_CELL end_ROW (6)
Refer to caption
Figure 4: Overall process for EHD feature. The image is first convolved with Sobel kernels [sobel19683x3] to generate edge responses for each orientation. The next step is to record the orientation that has the max response. Once the max response is recorded, a threshold is applied to ensure only “strong” edges are retained. The final step is to aggregate the max responses for each edge orientation to produce the final feature maps. The edge counts are binned into a histogram feature vector (similar to LBP).

The “no edge” case is when all edge responses are lower than the global threshold. When this condition is met, a value of “1” is assigned to the “no edge” or isotropic orientation

(Rijk+1)={1,k=1,,K;Rijk<θG0,otherwisesubscript𝑅𝑖𝑗𝑘1casesformulae-sequence1for-all𝑘1𝐾subscript𝑅𝑖𝑗𝑘subscript𝜃𝐺otherwise0𝑜𝑡𝑒𝑟𝑤𝑖𝑠𝑒otherwise\mathcal{B}(R_{ijk+1})=\begin{cases}1,\forall k=1,...,K;R_{ijk}<\theta_{G}\\ 0,otherwise\end{cases}caligraphic_B ( italic_R start_POSTSUBSCRIPT italic_i italic_j italic_k + 1 end_POSTSUBSCRIPT ) = { start_ROW start_CELL 1 , ∀ italic_k = 1 , … , italic_K ; italic_R start_POSTSUBSCRIPT italic_i italic_j italic_k end_POSTSUBSCRIPT < italic_θ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL 0 , italic_o italic_t italic_h italic_e italic_r italic_w italic_i italic_s italic_e end_CELL start_CELL end_CELL end_ROW (7)

An example of EHD features is shown in Figure 4. “Engineered” histogram-based features such as LBP and EHD have the same limitations as traditional feature engineering approaches. These limitations include manually tuning parameters and domain expertise [nanni2017handcrafted]. To overcome these issues, others have proposed to 1) combine traditional and deep learning feature extraction approaches and 2) design neural networks to extract “engineered” features through network design.

Refer to caption
(a) NEHD Implementation
Refer to caption
(b) NLBP Implementation
Figure 5: The proposed neural “engineered” feature pipeline for NEHD (Figure 5(a)) and NLBP (Figure 5(b)). The structural texture information is captured in the blue box and the statistical texture information is shown in the green box. In Figure 5(a), the input image is convolved with kernels to capture each edge orientation. The “no edge” operation can be implemented using a threshold or convolution layer. In Figure 5(b), the input image is convolved with kernels to capture the difference between neighboring pixels and the center pixel. The “threshold” operation is implemented using an activation function (e.g., sigmoid). After the “threshold” function, a 1×1111\times 11 × 1 convolution is used to compute a weighted sum operation of the difference maps to generate the “bit” map of LBP highlighted in orange. After the structural texture information is extracted, both features use a histogram layer to aggregate the structural texture information into statistical texture features.

II-B Combination of Traditional and Deep Learning Approaches

Traditional feature extraction approaches and deep learning methods are used 1) separately and 2) together. Generally, there are five approaches for combining traditional and deep learning approaches [peeples2022connecting]: 1) take features from deep learning model and pass into traditional classifiers [scabini2019evaluating, sani2017learning, zhang2016svm], 2) “engineer” filters of CNN (may update through backpropagation or keep fixed) or emulate “engineered” features via the network design [bruna2013invariant, chan2015pcanet, malof2018improving, bianconi2019cnn, Su_2021_ICCV] , 3) pass “engineered” features into the network as input [muhammad2017tex, anwer2018binary, van2019feeding], 4) combine both “engineered” and CNN features for traditional classifiers [nguyen2018combining, paul2016combining, wu2016multi], and 5) texture encoding methods [cimpoi2015deep, zhang2017deep, song2017locally, xue2018deep, hu2019multi].

There are several problems with existing approaches for combining traditional and deep learning methods. First, an issue with incorporating “engineered” features are that the models cannot be trained in an end-to-end fashion because the “engineered” features are not updated as performance changes. Also, there are additional computational costs for training deep learning model(s) and separate classifier(s). Along with the increased computational costs, more parameter tuning is necessary for the “engineered” features, deep learning model(s), and classifier(s). Lastly, deep learning features perform well in practice, but these features are not easily explainable and/or interpretable as traditional features though there are ongoing efforts to “open” the black box [rudin2022interpretable].

III Method

III-A Neural “Engineered” Features Histogram Layer

The baseline histogram layer function in [peeples2021histogram] equally weights input features. As constructed, this implementation of the histogram layer will not be able to account for structural changes in texture information. For example, two binary images of a cross and checkerboard pattern will have the same distribution of pixels, but the arrangement of these pixels is different [peeples2022histogram]. The unweighted histogram layer will misidentify these two different structural texture types. To account for this, one can simply learn a weighting of the input features to account for structural differences. Our proposed neural “engineered” feature layer, f𝑓fitalic_f, takes an input image or feature map(s), 𝐗𝐗\mathbf{X}bold_X, and applies two functions to extract texture information as shown in Equation 8:

f(𝐗)=ϕ(ρ𝒩ψ(𝐱ρ))𝑓𝐗italic-ϕsubscript𝜌𝒩𝜓subscript𝐱𝜌f(\mathbf{X})=\phi\left(\sum_{\rho\in\mathcal{N}}\psi(\mathbf{x}_{\rho})\right)italic_f ( bold_X ) = italic_ϕ ( ∑ start_POSTSUBSCRIPT italic_ρ ∈ caligraphic_N end_POSTSUBSCRIPT italic_ψ ( bold_x start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT ) ) (8)

where ϕitalic-ϕ\phiitalic_ϕ and ψ𝜓\psiitalic_ψ represents the statistical and structural texture information respectively. Generally, ψ𝜓\psiitalic_ψ is implemented as a local feature extractor in a given neighborhood 𝒩𝒩\mathcal{N}caligraphic_N (e.g., 3×3333\times 33 × 3 edge kernel) and ϕitalic-ϕ\phiitalic_ϕ is selected as a global or local operation.

ϕitalic-ϕ\phiitalic_ϕ represents the histogram layer introduced in [peeples2021histogram]. The histogram layer output is shown in Equation 9. The normalized frequency count, ϕrcbksubscriptitalic-ϕ𝑟𝑐𝑏𝑘\phi_{rcbk}italic_ϕ start_POSTSUBSCRIPT italic_r italic_c italic_b italic_k end_POSTSUBSCRIPT, is computed with a sliding window of size S×T𝑆𝑇S\times Titalic_S × italic_T and the binning operation for a histogram value in the kthsuperscript𝑘𝑡k^{th}italic_k start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT channel of the input x𝑥xitalic_x:

ϕrcbk=1STs=1St=1Texp(γbk2(ρ𝒩ψbk(𝐱ρ)μbk)2)subscriptitalic-ϕ𝑟𝑐𝑏𝑘continued-fraction1𝑆𝑇superscriptsubscript𝑠1𝑆superscriptsubscript𝑡1𝑇superscriptsubscript𝛾𝑏𝑘2superscriptsubscript𝜌𝒩subscript𝜓𝑏𝑘subscript𝐱𝜌subscript𝜇𝑏𝑘2\phi_{rcbk}=\cfrac{1}{ST}\sum_{s=1}^{S}\sum_{t=1}^{T}\exp\left(-\gamma_{bk}^{2% }\left(\sum_{\rho\in\mathcal{N}}\psi_{bk}(\mathbf{x}_{\rho})-\mu_{bk}\right)^{% 2}\right)italic_ϕ start_POSTSUBSCRIPT italic_r italic_c italic_b italic_k end_POSTSUBSCRIPT = continued-fraction start_ARG 1 end_ARG start_ARG italic_S italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_s = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT roman_exp ( - italic_γ start_POSTSUBSCRIPT italic_b italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( ∑ start_POSTSUBSCRIPT italic_ρ ∈ caligraphic_N end_POSTSUBSCRIPT italic_ψ start_POSTSUBSCRIPT italic_b italic_k end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT ) - italic_μ start_POSTSUBSCRIPT italic_b italic_k end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) (9)

where r𝑟ritalic_r and c𝑐citalic_c are spatial dimensions of the histogram feature maps. The histogram layer is used to aggregate the structural texture information from the input. We detail the structural texture feature extraction process for NEHD and NLBP in the next two sections.

III-B Neural Edge Histogram Descriptor

NEHD captures structural texture information by accounting for edge information. ψbkNEHDsubscriptsuperscript𝜓𝑁𝐸𝐻𝐷𝑏𝑘\psi^{NEHD}_{bk}italic_ψ start_POSTSUPERSCRIPT italic_N italic_E italic_H italic_D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_b italic_k end_POSTSUBSCRIPT is shown in Equation 10 where the input, x𝑥xitalic_x, is weighted by wmnksubscript𝑤𝑚𝑛𝑘w_{mnk}italic_w start_POSTSUBSCRIPT italic_m italic_n italic_k end_POSTSUBSCRIPT corresponding to a value in an M×N𝑀𝑁M\times Nitalic_M × italic_N kernel for each input channel is defined as:

ψbkNEHD=m=1Mn=1Nwmnkxr+s+m,c+t+n,ksubscriptsuperscript𝜓𝑁𝐸𝐻𝐷𝑏𝑘superscriptsubscript𝑚1𝑀superscriptsubscript𝑛1𝑁subscript𝑤𝑚𝑛𝑘subscript𝑥𝑟𝑠𝑚𝑐𝑡𝑛𝑘\psi^{NEHD}_{bk}=\sum_{m=1}^{M}\sum_{n=1}^{N}w_{mnk}x_{r+s+m,c+t+n,k}italic_ψ start_POSTSUPERSCRIPT italic_N italic_E italic_H italic_D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_b italic_k end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_m italic_n italic_k end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_r + italic_s + italic_m , italic_c + italic_t + italic_n , italic_k end_POSTSUBSCRIPT (10)

Equation 10 corresponds to the edge responses, \mathcal{R}caligraphic_R, discussed in Equation 5. Once the structural texture is captured, the output from ψNEHDsuperscript𝜓𝑁𝐸𝐻𝐷\psi^{NEHD}italic_ψ start_POSTSUPERSCRIPT italic_N italic_E italic_H italic_D end_POSTSUPERSCRIPT, is then passed into the histogram layer. Unlike the baseline EHD feature, the NEHD can be trained end-to-end to update both the structural and statistical texture representation.

The NEHD edge responses can be easily implemented using a convolutional layer to capture the edge feature maps as shown in Figure 5(a). The histogram layer is also implemented using two convolutional layers for the binning of the edge responses. For the “no-edge” orientation, two approaches can be used: thresholding and convolution. Similar to EHD, after the edge responses are computed from Equation 10, a threshold can be applied to detect if there is a “strong” edge present. The “no-edge” map could then be simply concatenated to the edge feature maps from the convolutional layer. The second approach is to learn the thresold operation by using a 1×1111\times 11 × 1 convolution to map the edge response maps to a single channel. A non-linearity (e.g., sigmoid) is then applied to the single feature map to learn a differentiable threshold operation. We investigate both approaches in Section V.

III-C Neural Local Binary Pattern

NLBP captures structural texture information by accounting for pixel differences. Similar to the LBCNN approach [liu2017local], ψbkNLBPsubscriptsuperscript𝜓𝑁𝐿𝐵𝑃𝑏𝑘\psi^{NLBP}_{bk}italic_ψ start_POSTSUPERSCRIPT italic_N italic_L italic_B italic_P end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_b italic_k end_POSTSUBSCRIPT is shown in Equation 11 where the input, x𝑥xitalic_x, is weighted by wmnksubscript𝑤𝑚𝑛𝑘w_{mnk}italic_w start_POSTSUBSCRIPT italic_m italic_n italic_k end_POSTSUBSCRIPT corresponding to a value in an M×N𝑀𝑁M\times Nitalic_M × italic_N kernel for each input channel is

ψbkNLBP=z=1Zσ(m=1Mn=1Nwmnkxr+s+m,c+t+n,k)𝒱kzsubscriptsuperscript𝜓𝑁𝐿𝐵𝑃𝑏𝑘superscriptsubscript𝑧1𝑍𝜎superscriptsubscript𝑚1𝑀superscriptsubscript𝑛1𝑁subscript𝑤𝑚𝑛𝑘subscript𝑥𝑟𝑠𝑚𝑐𝑡𝑛𝑘subscript𝒱𝑘𝑧\psi^{NLBP}_{bk}=\sum_{z=1}^{Z}\sigma\left(\sum_{m=1}^{M}\sum_{n=1}^{N}w_{mnk}% x_{r+s+m,c+t+n,k}\right)\mathcal{V}_{kz}italic_ψ start_POSTSUPERSCRIPT italic_N italic_L italic_B italic_P end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_b italic_k end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_z = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_Z end_POSTSUPERSCRIPT italic_σ ( ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_m italic_n italic_k end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_r + italic_s + italic_m , italic_c + italic_t + italic_n , italic_k end_POSTSUBSCRIPT ) caligraphic_V start_POSTSUBSCRIPT italic_k italic_z end_POSTSUBSCRIPT (11)

where σ𝜎\sigmaitalic_σ is an activation function (e.g., sigmoid, ReLU) and 𝒱kzsubscript𝒱𝑘𝑧\mathcal{V}_{kz}caligraphic_V start_POSTSUBSCRIPT italic_k italic_z end_POSTSUBSCRIPT is a learnable 1×1111\times 11 × 1 convolution. Equation 11 parallels Equations 1 and 2. The difference between a neighbor pixel and center pixel can be implemented using sparse convolution kernels where the weight on the center pixel is 11-1- 1, a neighbor pixel is 1, and all other values in the kernel are 00 as shown in Figure 5(b). This operation can be extended to other neighbors by rotating the kernel to process all pixels in the defined neighborhood. After the difference is computed, instead of the threshold operation, an activation function such as sigmoid can be used to map the values between 00 and 1111. The final step is to multiply the “thresholded” difference maps by the binary base. This can also be implemented using the 1×1111\times 11 × 1 convolution where the base value can be learned and not fixed to be a power of 2222.

IV Experimental Setup

Table I: Fashion MNIST NEHD initialization and parameter learning mean test accuracy with ±plus-or-minus\pm± 1 standard deviation across three experimental runs is shown. The baseline EHD approach had an average test accuracy of 86.94±0.00plus-or-minus86.940.0086.94\pm 0.0086.94 ± 0.00. NEHD was initialized using a) EHD or b) randomly. The “no-edge” was extracted by a) learned using a convolution layer or b) threshold of edge response maps. The NEHD layer consisted of structural (convolution kernels) and statistical (histogram layer) texture features. The impact of updating both texture representations was captured across the columns of the table. The model with the best average test accuracy is bolded.
Random Initialization No Edge Threshold Learn Both Learn Structural Learn Statistical Fix Both
89.74 ±plus-or-minus\pm± 0.12 89.09 ±plus-or-minus\pm± 0.24 88.81 ±plus-or-minus\pm± 0.05 83.82 ±plus-or-minus\pm± 0.26
89.51 ±plus-or-minus\pm± 0.28 89.28 ±plus-or-minus\pm± 0.24 88.35 ±plus-or-minus\pm± 0.02 83.36 ±plus-or-minus\pm± 0.06
89.13 ±plus-or-minus\pm± 0.23 88.87 ±plus-or-minus\pm± 0.27 88.50 ±plus-or-minus\pm± 0.17 78.08 ±plus-or-minus\pm± 3.14
89.16 ±plus-or-minus\pm± 0.23 88.52 ±plus-or-minus\pm± 0.22 88.28 ±plus-or-minus\pm± 0.30 76.91 ±plus-or-minus\pm± 2.54

The experiments are divided into two main parts: ablation study and feature comparison. All experiments used a shallow model that consisted of the input image, feature extractor (EHD, LBP, NEHD, and NLBP), and a fully connected layer. We chose this simple model to focus on the discriminative power of each feature using a linear classifier. In the ablation study, the FashionMNIST [xiao2017fashion] dataset is used to investigate the impact of a) initialization and b) parameter learning in Section V-A. For the initialization experiments, the impact of initializing the layer randomly or with the baseline feature setting (i.e., NEHD kernels initialized with Sobel kernels and threshold operation) was evaluated. The feature learning focused on the contribution of learning a) statistical texture features, b) structural texture features, and c) both. Additionally, texture features such LBP and EHD have been typically applied to grayscale or single channel inputs [porebski2008haralick]. The Plant Root Minirhizotron Imagery (PRMI) dataset [xu2022prmi] consists of RGB plant images. For both NEHD and NLBP, different multichannel approaches were investigated in Section V-B: treat each channel independently, use 1×1111\times 11 × 1 convolution to map an RGB image to single channel input, and converting an RGB image to grayscale.

An extensive comparison is performed across three datasets (FashionMNIST [xiao2017fashion], PRMI [xu2022prmi], BloodMNIST [medmnistv1, medmnistv2, bloodmnist]) with the baseline EHD and LBP, LBP variants, NEHD, and NLBP as discussed in Section V-C. EHD was implemented using Pytorch. In order to compare our proposed method with the original LBP and LBP variants, the delayed function from the Dask package [rocklin2015dask] was used to parallelize the Sklearn LBP feature extraction for each mini-batch of images. Each mini-batch is passed through a loop that holds each image to be computed in memory. Then, the Dask function produces a new list where the objects are all computed at the same time, and this list is converted to a tensor of the appropriate shape. This method allowed for faster processing speeds to compute the LBP of a mini-batch of images and integrate the features into the Pytorch framework.

BloodMNIST and PRMI have dedicated training, validation, and test partitions. For FashionMNIST, the training data was divided into a 90/10 train and validation split. The trained model was then applied to the holdout test set. A subset of the PRMI (following [peeples2022divergence]) was chosen using the following four classes: cotton, sunflower, papaya, and switch grass. No data augmentation was used for FashionMNIST. For BloodMNIST and PRMI, the data augmentation procedure followed [peeples2021histogram, xue2018deep]. The images resized to 128×128128128128\times 128128 × 128 with random and center crops of 112×112112112112\times 112112 × 112 as input into the models.

The experimental parameters were the following: 100 epochs, batch size of 64, Adam optimization, and initial learning rate of .001.001.001.001 for EHD, NEHD, and NLBP. The initial learning rate for the LBP baseline and variants was 0.010.010.010.01. The EHD and NEHD parameters were set to the following: 3×3333\times 33 × 3 edge kernel, 5×5555\times 55 × 5 window size to aggregate bin counts, threshold of .9.9.9.9, normalization of count (i.e., average pooling), and normalized kernel values. Both EHD and NEHD were set to extract eight edge orientation and ”no edge” resulting in a 9999-bin histogram.

The baseline LBP and variants used a radius of 1 and neighborhood size (P𝑃Pitalic_P) set to 8888. Each LBP approach used the default number of bins: baseline LBP (256), uniform LBP (59), nri-uniform LBP (59), ror LBP (256), and var LBP (256). NLBP used the following settings: 3×3333\times 33 × 3 pixel difference kernel, 5×5555\times 55 × 5 window size to aggregate bin counts, ReLU activation function, normalization of count (i.e., average pooling), and 16-bin histogram. The ReLU activation function was used in NLBP to promote better learning [liu2017local] and the input range of the data was between 00 and 1111 (range of the pixel difference is 11-1- 1 and 1111 resulting in the ReLU function map** the data between 00 and 1111). A total of five experimental runs were completed for each model on the PRMI dataset while three experimental runs were used for BloodMNIST and FashionMNIST for each configuration. Experiments were completed on an A100 GPU.

V Results and Discussion

V-A Ablation Study: Initialization and Parameter Learning

Table II: Fashion MNIST NLBP initialization and parameter learning mean test accuracy with ±plus-or-minus\pm± 1 standard deviation across three experimental runs is shown. The baseline LBP approach had an average test accuracy of 71.66±0.02plus-or-minus71.660.0271.66\pm 0.0271.66 ± 0.02. NLBP was initialized using a) LBP or b) randomly. The NLBP base power was either a) fixed (power of 2222) or b) learned. The NLBP layer consisted of structural (convolution kernels) and statistical (histogram layer) texture features. The impact of updating both texture representations was captured across the columns of the table. The model with the best average test accuracy is bolded.
Random Initialization Fixed Base Learn Both Learn Structural Learn Statistical Fix Both
85.54 ±plus-or-minus\pm± 0.07 85.70 ±plus-or-minus\pm± 0.01 86.63 ±plus-or-minus\pm± 0.02 79.28 ±plus-or-minus\pm± 0.00
85.52 ±plus-or-minus\pm± 0.04 85.56 ±plus-or-minus\pm± 0.01 86.68 ±plus-or-minus\pm± 0.02 78.65 ±plus-or-minus\pm± 0.00
87.44 ±plus-or-minus\pm± 0.40 86.67 ±plus-or-minus\pm± 0.50 85.20 ±plus-or-minus\pm± 0.03 81.84 ±plus-or-minus\pm± 0.51
87.50 ±plus-or-minus\pm± 0.30 81.53 ±plus-or-minus\pm± 0.42 80.74 ±plus-or-minus\pm± 0.48 36.05 ±plus-or-minus\pm± 1.60

V-A1 NEHD

The results of the NEHD initialization and parameter learning are shown in Table I. The proposed NEHD layer was robust to the random or EHD initialization as shown by overlap** error bars for each test accuracy except for when the structural and statistical textures were fixed. When the layer is fixed, the EHD initialization performance was more statistically significant than random initialization. Another interesting observation is that for most learnable texture features, the EHD initialization led to marginally better performance. Edges are important features that distinguish the classes in FashionMNIST as there is no background in the images and only the articles of clothing are present. Therefore, the edge “profiles” for each clothing item can be a powerful discriminator between the different classes. This analysis is further validated as the baseline EHD feature achieved an average accuracy higher (86.9486.9486.9486.94) than all of the fixed statistical and structural texture features models with random or EHD initialization. The EHD initialization for NEHD is not an exactly equal to EHD due to the soft binning approximation of the histogram layer.

The most difficult classes to distinguish were the “T-shirt/top” and “shirt.” The baseline EHD had difficulty differentiating between these two classes as the representation is fixed and cannot adapt. However, our proposed NEHD layer outperformed the baseline EHD feature for learnable settings: learn structural, statistical, or both texture features. For the structural texture features, the edge kernel can be adapted from the initial Sobel kernels to account for more details regarding the structure of each clothing item. The statistical features assist with the soft binning approximation to mitigate small intra-class variations for each clothing item. As noted in Table I, the combination of learning both texture feature representations achieved the best performance on the FashionMNIST test set.

V-A2 NLBP

The results of the ablation study on NLBP are shown in Table II. The NLBP feature was robust to either random or LBP initialization except for the case when the kernel was randomly initialized and the base was fixed to be a power of 2222. This result matches similar analysis from the [liu2017local] that shows that fixing the base power may limit the generalization ability of the feature. Another interesting observation is that the randomly initialized models performed slightly better for every setting except when learning statistical features and fixing both texture feature parameters with a fixed base. For the statistical features, the initialization of the histogram layer to meaningful bin centers and widths will assist in the learning process of the model. For the structural textures, randomly initializing the kernels lead to an improvement in performance except for when the base power was fixed. This result indicates that the initialization of the kernel used to capture the relationship between neighboring pixels is important when capturing the structural texture information. The LBP initialized the NLBP feature with sparse kernels and this may cause limit the learning ability of the model.

Similar to NEHD, maximal performance is achieved when both structural and statistical texture features are jointly learned. When LBP initialization is used, updating the statistical features achieved higher performance than only updating the structural features. This result is intuitive as a limitation of the original LBP is that the histogram can be sparse if some of the 256 LBP codes are not well represented. With the NLBP approach, the aggregation of the LBP encoding or bit map can be adjusted by updating the bin centers and widths to maximize the performance on different tasks such as image classification. For NLBP, all learning settings (i.e., updating structural and/or statistical textures) statistically significantly outperformed the baseline LBP (71.6671.6671.6671.66 average test accuracy) feature and fixing all parameters demonstrating the utility of the proposed neural “engineered” feature.

V-B Ablation Study: Multichannel processing

Table III: Multichannel processing approaches for NEHD and NLBP on the PRMI dataset. The mean test accuracy with ±plus-or-minus\pm± 1 standard deviation across five experimental runs is shown. The independent setting applied the neural “engineered” feature to each channel separately, 1×1111\times 11 × 1 convolution was a learnable map** of the three channel input to a single channel, and grayscale was converting the image to a single channel intensity image. The NLBP and NEHD fusion approach with the best average test accuracy is bolded for each feature.
Model Independent 1×1111\times 11 × 1 Convolution Grayscale
NEHD 89.92 ±plus-or-minus\pm± 0.21 88.18±5.60plus-or-minus88.185.6088.18\pm 5.6088.18 ± 5.60 89.45±1.56plus-or-minus89.451.5689.45\pm 1.5689.45 ± 1.56
NLBP 91.17 ±plus-or-minus\pm± 1.06 91.08±3.00plus-or-minus91.083.0091.08\pm 3.0091.08 ± 3.00 88.76±6.44plus-or-minus88.766.4488.76\pm 6.4488.76 ± 6.44

EHD and LBP are generally applied to grayscale or single channel inputs (though efforts have been made to extend these features to multichannel inputs [humeau2022color]). For NEHD and NLBP, we explored three approaches for applying these neural “engineered” features to RGB images like the PRMI dataset [xu2022prmi]. One approach was to treat each channel independently and extract the neural “engineered” features similar to [peeples2021histogram]. For example, with NEHD, the feature can be applied to the red, green and blue channel. Each input channel would have D𝐷Ditalic_D edge orientation histogram feature maps (no edge included). The resulting feature maps would be concatenated together before the next layer in an artificial neural network. Two other approches were a) 1×1111\times 11 × 1 convolution and b) grayscale conversion would take the RGB image and convert the image to a single channel. The grayscale conversion would work an RGB input image, but the 1×1111\times 11 × 1 convolution approach would generalize for input features maps of varying channel dimensions.

Table IV: Comparison results of our proposed neural “engineered” features and “engineered” features across different datasets. The mean test accuracy with ±plus-or-minus\pm± 1 standard deviation across multiple experimental runs is shown. The model with the highest test accuracy for each dataset is bolded.
Method FashionMNIST PRMI BloodMNIST
EHD 86.94±0.00plus-or-minus86.940.0086.94\pm 0.0086.94 ± 0.00 54.58±6.62plus-or-minus54.586.6254.58\pm 6.6254.58 ± 6.62 66.33±0.07plus-or-minus66.330.0766.33\pm 0.0766.33 ± 0.07
LBP 71.66±0.02plus-or-minus71.660.0271.66\pm 0.0271.66 ± 0.02 69.68±0.12plus-or-minus69.680.1269.68\pm 0.1269.68 ± 0.12 43.25±0.07plus-or-minus43.250.0743.25\pm 0.0743.25 ± 0.07
LBP ROR 51.10±0.12plus-or-minus51.100.1251.10\pm 0.1251.10 ± 0.12 70.70±0.11plus-or-minus70.700.1170.70\pm 0.1170.70 ± 0.11 40.32±0.07plus-or-minus40.320.0740.32\pm 0.0740.32 ± 0.07
LBP Uniform 28.26±0.01plus-or-minus28.260.0128.26\pm 0.0128.26 ± 0.01 67.17±0.00plus-or-minus67.170.0067.17\pm 0.0067.17 ± 0.00 33.41±0.06plus-or-minus33.410.0633.41\pm 0.0633.41 ± 0.06
LBP NRI Uniform 51.07±0.02plus-or-minus51.070.0251.07\pm 0.0251.07 ± 0.02 71.30±0.03plus-or-minus71.300.0371.30\pm 0.0371.30 ± 0.03 33.92±0.16plus-or-minus33.920.1633.92\pm 0.1633.92 ± 0.16
LBP Var 10.00±0.00plus-or-minus10.000.0010.00\pm 0.0010.00 ± 0.00 59.93±0.11plus-or-minus59.930.1159.93\pm 0.1159.93 ± 0.11 19.47±0.00plus-or-minus19.470.0019.47\pm 0.0019.47 ± 0.00
NEHD (ours) 89.74 ±plus-or-minus\pm± 0.12 89.92±0.21plus-or-minus89.920.2189.92\pm 0.2189.92 ± 0.21 83.60 ±plus-or-minus\pm± 0.35
NLBP (ours) 87.50±0.30plus-or-minus87.500.3087.50\pm 0.3087.50 ± 0.30 91.17 ±plus-or-minus\pm± 1.06 76.06±1.47plus-or-minus76.061.4776.06\pm 1.4776.06 ± 1.47
Refer to caption
(a) EHD (61.25%)
Refer to caption
(b) NEHD (90.17%)
Refer to caption
(c) LBP (71.33%)
Refer to caption
(d) NLBP (92.50%)
Figure 6: TSNE results for the best EHD, NEHD, LBP, and NLBP on the PRMI test set. The test accuracy for each approach is shown in parentheses. The random seed for each approach was the same in order to have a fair qualitative comparison between the feature visualizations. Each color corresponds to a different class in the PRMI dataset. The NEHD and NLBP models appear to have more compact classes in the projected space. For example, the switchgrass class (red data points) appears more compact for NEHD than EHD. When comparing NLBP and LBP, NLBP results in a more compact sunflower class (green data points).
Refer to caption
(a) EHD (54.58±6.62plus-or-minus54.586.6254.58\pm 6.6254.58 ± 6.62%)
Refer to caption
(b) NEHD (89.92±0.21plus-or-minus89.920.2189.92\pm 0.2189.92 ± 0.21%)
Refer to caption
(c) LBP (71.30±0.03plus-or-minus71.300.0371.30\pm 0.0371.30 ± 0.03%)
Refer to caption
(d) NLBP (91.17±1.06plus-or-minus91.171.0691.17\pm 1.0691.17 ± 1.06%)
Figure 7: The confusion matrices for the best EHD, LBP (LBP NRI Uniform), NEHD, and NBLP are shown for the PRMI test set. The mean test accuracy with ±plus-or-minus\pm± 1 standard deviation is shown in parentheses. Both NEHD and NLBP improve the identification of each class in comparison to the “engineered” features.

The results of our multichannel processing for the best NLBP and NEHD from Section V-A are shown in Table III. From the results, independently processing each channel achieved the highest average test accuracy. Independently processing each channel leads to the most information so this result is intuitive. However, the number of features will scale as the number of input feature channels are increased. When observing the results of converting to gray scale and applying a 1×1111\times 11 × 1 convolution, the average test accuracy is comparable to the independent processing approach, though these approaches have increased variability indicated by the larger standard deviation when compared to the standard deviation of the independent processing approach. When integrating the neural “engineered” features into a deeper network, the 1×1111\times 11 × 1 approach may be the most ideal approach as this can reduce the number of features extracted when compared to the independent fusion approach. The grayscale conversion is limited to RGB inputs.

V-C Comparison of Neural and “Engineered” Features

The last set of experiments compared our proposed NEHD and NLBP with the baseline EHD and LBP variants across three different datasets: FashionMNIST [xiao2017fashion], PRMI [xu2022prmi], BloodMNIST [medmnistv1, medmnistv2, bloodmnist]. The results of the comparisons are shown in Table IV. Overall, both NEHD and NLBP outperform the “engineered” features across all three datasets. For FashionMNIST, as discussed in Section V-A, edges are important to distinguish the different articles of clothes resulting in EHD achieving the best performance for “engineered” features. However, our NEHD and NLBP achieve statistically significantly (i.e., no overlap** error bars) improved performance. PRMI and BloodMNIST are consists of RGB images. Each feature was applied to each channel independently as a fair comparison between each approach. The PRMI results show that NLBP slightly outperformed NEHD. For the PRMI datasets, edge information is important to identify the roots, but there are some images have some illumination differences. LBP is robust to illumination variations (e.g., monotonic changes) [humeau2019texture], resulting in NLBP retaining a similar property when extracting texture features from the images.

The NEHD and NLBP are compared with the best EHD and LBP variant qualitatively (Figure 6) and quantitatively (Figure 7). As seen in the TSNE visuals in Figure 6, the features learned by NEHD and NLBP appear more compact in the 2D projection than the “engineered” features. An important aspect of both approaches is the aggregation performed by the histogram layer. The soft binning approach provides a way to account for intra-class variations that can be present within the dataset [peeples2021histogram]. As noted in the confusion matrices in Figure 7, the neural “engineered” features outperform the “engineered” features. For some of the PRMI images, the root are occluded with background information (e.g., soil, image artifacts) and the fixed representation of the “engineered” features is limited in mitigating the impact of these confusers in the images. However, our neural “engineered” features can change the texture representation to improve classification performance.

BloodMNIST was the most difficult dataset for the shallow network used since the number of classes was eight (more than PRMI) in comparison and the blood cells visually have small differences between the different classes. However, our proposed features achieved statistically significant higher test accuracy than the “engineered” features. Texture information is vital in several biomedical applications [liu2019bow]. Our proposed approach can be applied to several biomedical tasks and integrated into deeper networks to achieve improved peformance on the BloodMNIST dataset.

VI Conclusion

In this work, we proposed neural handcrafted features by using histogram layers to aggregate structural texture features. We demonstrate the approach by introducing two neural handcrafted features: NEHD and NLBP. Our results across benchmark (FashionMNIST) and real-world (PRMI and BloodMNIST) datasets showcase the potential use of these features. The general framework for neural handcrafted features can be used for other texture feature approaches such as Haralick texture features [haralick1973textural], histogram of oriented gradients [dalal2005histograms] and several more statistical and structural texture feature extraction methods [liu2019bow, humeau2019texture]. Future work includes integrating the neural handcrafted layer(s) into deeper networks, improving the multichannel processing approach, and designing new objective functions to maximize statistical and/or structural texture information.

Acknowledgment

This material is based upon work supported by the National Science Foundation Graduate Research Fellowship under Grant No. DGE-1842473 and by the Office of Naval Research grant N00014-16-1-2323. The views and opinions of authors expressed herein do not necessarily state or reflect those of the United States Government or any agency thereof. Portions of this research were conducted with the advanced computing resources provided by Texas A&M High Performance Research Computing.

\printbibliography
[Uncaptioned image] Joshua Peeples received the Ph.D. degree in electrical and computer engineering from the University of Florida in 2022. He is currently an Assistant Professor in the Department of Electrical and Computer Engineering at Texas A&M University. His primary research interests include machine learning, computer vision, and image processing with a focus on image texture analysis.
[Uncaptioned image] Salim Al Kharsa is an undergraduate in the Department of Electrical and Computer Engineering at Texas A&M University. He has an expected graduation date of May 2024.
[Uncaptioned image] Luke Saleh is an undergraduate Computer Engineering student at the University of Florida with an expected graduation date of December 2024. He is currently working as an undergraduate researcher for the Machine Learning and Sensing Lab at UF. His primary interests include embedded systems, machine learning, digital hardware design, and data science.
[Uncaptioned image] Alina Zare Alina Zare (Senior Member, IEEE) teaches and conducts research in the area of machine learning and artificial intelligence as a Professor in the Electrical and Computer Engineering Department at the University of Florida. She also serves at the Associate Dean for Research and Facilities at the Herbert Wertheim College of Engineering. Dr. Zare’s research has focused primarily on develo** new machine learning algorithms to automatically understand and process data and imagery. Her research work has included automated plant root phenoty**, sub-pixel hyperspectral image analysis, target detection and underwater scene understanding using synthetic aperture sonar, LIDAR data analysis, Ground Penetrating Radar analysis, and buried landmine and explosive hazard detection.