StarLKNet: Star Mixup with Large Kernel Networks
for Palm Vein Identification

Xin **\orcid0009-0005-0983-6853¹¹1Equal contribution. Hongyu Zhu\orcid0009-0000-5993-4666²²footnotemark: 2 Mounîm A. El Yacoubi Hongchao Liao Hufeng Qin Yun Jiang Corresponding Author. Email: [email protected] Chongqing Technology and Business University, China Telecom SudParis, Institut Polytechnique de Paris, France

Abstract

As a representative of a new generation of biometrics, vein identification technology offers a high level of security and convenience. Convolutional neural networks (CNNs), a prominent class of deep learning architectures, have been extensively utilized for vein identification. Since their performance and robustness are limited by small Effective Receptive Fields (e.g. 3 $\times$ 3 kernels) and insufficient training samples, however, they are unable to extract global feature representations from vein images in an effective manner. To address these issues, we propose StarLKNet, a large kernel convolution-based palm-vein identification network, with the Mixup approach. Our StarMix learns effectively the distribution of vein features to expand samples. To enable CNNs to capture comprehensive feature representations from palm-vein images, we explored the effect of convolutional kernel size on the performance of palm-vein identification networks and designed LaKNet, a network leveraging large kernel convolution and gating mechanism. In light of the current state of knowledge, this represents an inaugural instance of the deployment of a CNN with large kernels in the domain of vein identification. Extensive experiments were conducted to validate the performance of StarLKNet on two public palm-vein datasets. The results demonstrated that StarMix provided superior augmentation, and LakNet exhibited more stable performance gains compared to mainstream approaches, resulting in the highest recognition accuracy and lowest identification error.

\paperid

1492

1 Introduction

The issue of personal information security has received increasing attention in modern society, as misidentification may have a catastrophic impact on personal property security and privacy. Token-based authentication methods such as passwords and ID cards are at risk of being forgotten or stolen. In recent decades, there has been a great deal of research conducted on biometrics technology, based on the identification of individuals through their physiological(e.g. face[18], fingerprint[2] and vein[38, 39]) or behavioral(e.g. gait[3] and eye movement[30]) characteristics. The most common biometric features used in applications are faces and fingerprints. However, these external features may be subject to potential forgery attacks[23]. In contrast, the advantages of vein recognition are significant. Veins and blood vessels are located inside the human body and are not easily affected by the external environment(e.g. skin moisture and wear). Furthermore, because deoxygenated hemoglobin only exists in the living body, vein recognition technology has inherent liveness detection[28].

Refer to caption — Figure 1: *Left:* Top-1 Accuracy( $\uparrow$ ) using MixUp and StarMix in different models; *Right:* StarMix and ResNet18 fitting curves on the VERA220 dataset; StarMix fits faster and with higher classification accuracy.

In the field of vein recognition, traditional methods rely on hand-crafted feature extraction and shallow machine learning algorithms for classification[11, 37]. These methods are based on some assumptions that the distributions of vein patterns show a valley or a line-like shape, and ignore valid information in higher dimensions. Methods based on deep learning can achieve end-to-end feature extraction without prior assumptions, but the training of network parameters requires a large amount of data support. Unfortunately, it is challenging to obtain a substantial number of samples from each class in practical applications due to limited storage and privacy policies. How to train a robust high-performance network with limited data is a pressing problem.

To address the overfitting issue arising from insufficient training data, researchers have proposed data augmentation (DA) techniques, through either hand-crafted or generative means to generate new data for expanding the training set. Mixup uses global linear interpolation to mix two or more samples. It has been widely utilized and improved by researchers, (e.g., CutMix[44], AutoMix[19], PuzzleMix[13], AdAutoMix[29], etc.), thanks to its plug-and-play functionality and minimal additional time overhead. Since its inception, Mixup has demonstrated outperformance and robust generalizability in downstream tasks, including Super-resolution[42], Segmentation[26], Regression[41], and Long-tail[1]. Starting with VGG[33], researchers have gradually shifted their focus away from the use of large kernels in CNNs, opting instead for a stacked approach involving multiple smaller kernels, as overly large kernels result in significant time and memory overheads, and tend to overlook localized, subtle features within images. Nevertheless, some recent studies[6, 16, 43, 7] have indicated that with some subtle improvements, CNNs with a larger effective sense field can rival the performance of Vision Transformer (ViT)[8]. Due to the continuous and sparse distribution of vein image features, we posit that large convolutional kernel networks have a significant advantage over small ones in capturing vein features’ distribution. As shown in Figure 1, the large kernel exhibits faster fitting speeds and higher classification accuracies than the small kernel framework represented by ResNet18[9].

To address the challenges of feature extraction from vein images, we propose a Mixup method for vein images to augment the training data. Furthermore, we design a high-performance palm vein recognition network framework based on the stacked formation of large kernel convolutional modules, in order to extract a more comprehensive and robust feature representation of palm vein images. First, we propose StarMix, featuring a more suitable mask for vein image mixing, with the mixing parameter generated from a Gaussian function. In our scheme, the network is free to choose the mixing strategy as either StarMix or vanilla Mixup with a threshold. Furthermore, we propose LaKNet, a network with convolutional and gating modules. The convolutional module comprises large kernel convolution and small kernel convolution, employed for extracting global and local features. The gating module facilitates feature filtering by learning to control the flow of feature information. Experimental results indicate that the StarLKNet model exhibits superior performance compared to the ResNet18 model on the VERA220[34] dataset, with an improvement in the test set top-1 accuracy of +19.73% without augmentation.

In summary, our main contributions are as follows:

•

We rethink the impact of convolutional kernel size on network performance in the vein recognition task, and find that for vein images with continuous and sparse feature distributions, increasing the Effective Receptive Field can significantly improve network performance.
•

We propose StarLKNet, a novel network framework designed for vein recognition, that employs a large kernel convolutional module and a gating module to achieve comprehensive and robust feature extraction. Our evaluation on two large public palm vein datasets demonstrates that StarLKNet outperforms existing methods in terms of recognition accuracy and validation error.
•

We propose StarMix, a data augmentation method that utilizes a Gaussian function to generate, for mixing, suitable masks for vein image feature distribution, thereby significantly enhancing classifier performance.

2 Related Work

Palm Vein Identification.

As Hemoglobin is absorbed in the infrared spectrum, infrared cameras can acquire vein images. Such an acquisition method and the characteristics of vein distribution bring challenges for feature extraction. Research made to address this challenge can be broadly categorized into two types of traditional methods, handcrafted features extracted and input to shallow machine learning, and CNN-based feature extraction methods: 1) In the first category, Miura et al.[24], for instance, performed repeated line tracking to detect valley shapes in cross-sectional vein patterns and extract finger-vein texture for verification. Other works, assuming that vein patterns in a predefined neighbor can be regarded as line segments, have proposed line detection methods to extract line-like textures, including Gabor-based[47] and wide line detectors[10]. As traditional shallow machine learning methods used in the vein domain, [4], for instance, employed principal component analysis (PCA) and linear discriminant analysis (LDA) for feature dimensionality reduction, combined with support vector machine (SVM) for feature classification; 2) CNNs demonstrated remarkable capability in extracting features in vein recognition tasks. Syafeeza et al., for instance, proposed a CNN with four layers for finger-vein identification[32], and later employed a pre-trained VGG16[33] model and an enhanced Conv with seven layers to identify the finger-veins [17].

MixUp.

The inception of mixing augmentation methods started with MixUp[45], which consists of A static linear interpolation of two samples according to a mixing ratio from 0 to 1 to obtain a mixed sample. CutMix[44] converts the MixUp sample from image pixels to space level, generating a mixing ratio-sized mask to randomly mix patches. Subsequently, several methods were proposed to improve the sample mixing policies or label mixing policies [36, 31, 14, 20]. In contrast to hand-crafted methods, SaliencyMix[35] obtains saliency information through an additional feature extractor, with guided mixing of the samples. With a similar aim, PuzzleMix[13] and Co-Mix[12] utilize gradient information for backward propagation to locate feature regions and employ an optimal transport scheme to avoid overlap** feature information by feature maximization in the mixed samples. AutoMix[19] adopts an end-to-end approach by designing a generator mix block that optimizes both the generator and the model, achieving an optimal result in terms of time overhead and performance. More recently, AdAutoMix[29], built upon AutoMix, has been proposed to augment the generated samples by mixing any set of $N$ samples and not only two; AdAutoMix also proposed adversarial training to prevent generator overfitting, by pushing the generator to generate difficult samples with more impact on improving training performance.

3 Preliminaries

3.1 Mixup

We define $\mathbb{X}$ to be the set of training samples and $\mathbb{Y}$ the set of ground truth of the corresponding labels. For each sample pair $(x,y)$ , $x\in\mathbb{R}^{w,h,c}$ , and $y\in\mathbb{R}^{C}$ is the corresponding one-hot label. where $w,h,c$ are the sample’s width, length, and channel, respectively; $C$ is the number of sample classes. We mix the sample pairs $(x_{i},y_{i})$ , $(x_{j},y_{j})$ by linear interpolation according to MixUp to obtain mixed samples and labels:

	$\displaystyle\hat{x}=\lambdax_{i}+(1-\lambda)x_{j},$		(1)
	$\displaystyle\hat{y}=\lambday_{i}+(1-\lambda)y_{j},$		(1)

where $\lambda$ is the mixing ratio from the $Beta(\alpha,\alpha)$ distribution. We map the mixed sample $\hat{x}$ to its label $\hat{y}$ by a deep neural network $f_{\theta}$ : $f_{\theta}(\hat{x})\mapsto\hat{y}$ . $f$ trains the network vector parameter $\theta$ continuously by minimizing the loss function, i.e..

3.2 Gating

The MogaNet gating mechanism in our network comprises $Conv_{1\times 1}$ and $SiLU$ activation function[15]. It provides a simple yet effective structure for filtering the flow of information during network training, thereby enhancing the efficacy of the extracted features. The objective of utilizing $SiLU$ is to integrate the smoothness and nonlinearity of the $Sigmoid$ function with the linear properties of the ReLU linear unit within the positive region:

\displaystyle SiLU(z)=z\cdot Sigmoid(z).

(2)

Specifically, $Conv_{1\times 1}$ is employed to implement the linear transformation of features, whereby the weights are adjusted to emphasize or suppress the features. The output of $Conv_{1\times 1}$ , $z\in\mathbb{R}^{w,h,c}$ , is utilized as input to $SiLU$ , which is capable of adjusting its activation value according to the importance of the input features, thereby realizing the effect of the gating mechanism:

\displaystyle gating(z)=SiLU(Conv_{1\times 1}(z)).

(3)

4 StarLKNet

Our proposed StarLKNet model, illustrated in Figure 2, consists of two components, StarMix and LaKNet. StarMix employs a mask generated by a Gaussian function to mix and augment the data, while LaKNet consists of a convolutional module with a large kernel and a gating module. In the next section, we first introduce our hybrid method StarMix, and then detail LaKNet.

Beta(\alpha,\alpha)

distribution, training samples and labels

\mathbb{X}

\mathbb{Y}

, mixing ratio

\lambda

, threshold [0.3, 0.7], Gaussian function

Ga(\cdot)

and

k

is kernel size.

x\in\mathbb{R}^{w,h,c}

2: for

x_{i}

y_{i}

\mathbb{X}

\mathbb{Y}

loder do

\lambda

Beta(\alpha,\alpha)

x_{j}

y_{j}

= torch.randperm(

x_{i}

y_{i}

5: if 0.3

\leq\lambda\leq

0.7 then

G

Ga(k,h,\sigma),

\hat{\lambda}

according to the Eq.7,

\hat{x}=G*x_{i}+(1-G)*x_{j}

\hat{y}=\hat{\lambda}*y_{i}+(1-\hat{\lambda})*y_{j}

9: else

10:

\hat{x}=\lambda*x_{i}+(1-\lambda)*x_{j}

\hat{y}=\lambda*y_{i}+(1-\lambda)*y_{j}

11: end if

12: end for

Algorithm 1 StarMix pseudo-code process.

4.1 StarMix

This section details the StarMix method; the pseudo-code for StarMix is provided in Algorithm 1.

StarMask.

As shown in Figure 3, compared to $\lambda$ in the vanilla MixUp, the mixing mask $StarMask$ , focusing more on the surroundings and center of the samples, adapts well to the distribution of the vein features. To obtain $StarMask$ , we define a Gaussian function $Ga(\cdot)$ to generate the mask $G$ . We assume a pair of samples $(x_{i},y_{i})$ , $(x_{j},y_{j})$ , with sample $x\in\mathbb{R}^{w,h,c}$ , and we consider a mixing parameter drawn $\lambda$ from $Beta(\alpha,\alpha)$ . The Gaussian function is given by Eq.4:

Ga(k,w,h,\sigma)=\mathrm{e}^{-\frac{\left(x_{G}-\frac{k}{2}\right)^{2}}{2% \sigma^{2}}},

(4)

where $k$ is the Gaussian kernel‘s size, set to 224, $x_{G}$ is the region division of the mask, $x_{G}=[k/2:k-k/2,\quad k/2:k/2-k]$ , $\sigma=\lambda*h$ . Passing the parameters to $Ga(\cdot)$ as in Eq.5 allows us to obtain the mask $M$ :

M=\frac{1}{N}\sum_{n=1}^{N}Ga(k_{n},h,\sigma_{n}),

(5)

where $N=3$ . Note that $\sigma_{2}=(1-\lambda)*h$ , $\sigma_{1}=\sigma_{3}$ , and $k_{3}=2k_{1}=448$ . We then normalize $M$ to get the final mask $G$ as Eq.6:

G=\frac{\lambda}{1+e^{-M}}.

(6)

After we get the mask $G$ , we need to recalculate the correct ratio for the mixing ratio $\lambda$ according to the Eq.7:

\hat{\lambda}=\frac{{\textstyle\sum_{i=1}^{w}}\textstyle\sum_{j=1}^{h}G_{i,j}}% {w\times h},

(7)

where $G_{i,j}$ denotes the pixel value at the ith row and jth column in $G$ . $G_{i,j}$ can be seen as the average of all pixel values in $G$ in terms of intensity. Finally, the mixed sample $\hat{x}$ and the corresponding label $\hat{y}$ are obtained according to Eq.8:

	$\displaystyle\hat{x}=Gx_{i}+(1-G)x_{j},$		(8)
	$\displaystyle\hat{y}=\hat{\lambda}y_{i}+(1-\hat{\lambda})y_{j}.$		(8)

Threshold setting.

As shown in Figure 2, the obtained masks are not reliable when $\lambda$ is less than 0.3 or more than 0.7 due to $Ga(\cdot)$ , so we propose the use of thresholding to avoid obtaining unreliable masks when mixing:

\displaystyle\hat{x}

\displaystyle=\left\{\begin{matrix}G*x_{i}+(1-G)*x_{j},&if\quad 0.3\leq\lambda% \leq 0.7,\\ \lambda*x_{i}+(1-\lambda)*x_{j},&otherwise.\end{matrix}\right.

(9)

When $\lambda$ is within the [0.3, 0.7] threshold range, StarMix is used, and when $\lambda$ is outside it, we choose vanilla Mixup. This threshold setting avoids the disadvantages of StarMix in special cases. The benefits of mixing StarMix and vanilla Mixup can be seen in Table 3.

4.2 LaKNet

This subsection presents a detailed description of the LaKNet modules and methods.

Architecture Specification.

LaKNet is composed by the following submodules:

Stem is the first module used as input to the network to capture more details by map** the original samples into a higher dimensional space. Stem comprises a $Conv_{3\times 3}$ with step=2 and channel=64. It is then downsampled by a $Conv_{1\times 1}$ with step=1 and a $DW$ $Conv_{3\times 3}$ (Depthwise Separable Convolution) with step=2.

Embedding sub-module of the stage is comprised of a $Conv_{1\times 1}$ , and a BatchNorm layer. The inputs and outputs of this module are activated by the $ReLU$ function. The main purpose of Embedding is to aggregate further the feature information of Stage’s inputs in the convolution, to prevent the model from overfitting the training set, and to avoid gradient exploding or vanishing. This is a common practice for most network models.

LaKBlock comprises two branches: a $Conv$ module and a gating module. The $Conv$ module employs a combination of $DW$ $Conv$ with a large kernel and a $Conv$ $set$ , to extend the Effective Receptive Field by capturing both global and local features. The $Conv$ $set$ uses some small convolution kernels, $Conv_{5\times 5}$ , to capture local features, and two dilation convolutions aiming at capturing the discrete features without additional overheads. The gating module employs a gating mechanism formed by the combination of $Conv_{1\times 1}$ and $ReLU$ activation functions to get effective features with input $\mathcal{Z}$ . We finally merge the output of the $Conv$ module and the output of the Gating module as the total output, which is used as input to the Neck layer.

Neck module position is used to connect different stages in the architecture and is used by a $Conv_{1\times 1}$ to increase the dimension of the channel input for the next stage.

FC Layer module is a $Linear$ layer intended to be used for the final classification, where the obtained feature information is subjected to a map** to obtain the final classification probability distribution.

To summarize, the channel dimensions of each stage are distinct and the large convolutional dimensions are different due to the downsampling operations. Consequently, the number of layers, channel dimensions, and large convolutions of each stage in LaKNet are $\left\{2,2,18,2\right\}$ , $\left\{128,256,512,1024\right\}$ , and $\left\{31,29,27,13\right\}$ , respectively.

Kernels mixing.

This paragraph explains the large kernel function $Lak(\cdot)$ and a set of small kernel function $Sak(\cdot)$ sets mixing method, aiming to enable the model to capture both local and global features. This allows for stable optimization of the model $f_{\theta}$ .

\mathcal{K}=Lak(\mathcal{Z})+Sak(\mathcal{Z}),

(10)

where $\mathcal{K}$ denotes the embedding layer output, $\mathcal{Z}$ was the $Lak(\cdot)$ and $Sak(\cdot)$ add output, $Sak(\cdot)$ was combined with a depthwise $Conv_{5\times 5}$ and two depthwise $Dilation$ $Conv_{3\times 3}$ with dilation set to 2 and 4.

Table 1: Top-1 accuracy(%)

\uparrow

and EER(%)

\downarrow

of Baseline, MixUp,and StarMix methods on TJU600 based on different CNNs.

TJU600	Baseline		MixUp		StarMix
TJU600	Acc $\uparrow$	EER $\downarrow$	Acc $\uparrow$	EER $\downarrow$	Acc $\uparrow$	EER $\downarrow$
VGG16[33]	73.65	4.07	86.08	2.6	85.57	2.52
ResNet18[9]	89.73	1.2	93.90	0.81	94.67	0.71
ResNet50	85.68	1.96	92.87	0.95	94.13	0.93
FVRASNet[40]	71.02	4.19	87.38	2.11	88.50	2.01
FVCNN[5]	58.65	9.02	79.19	3.73	80.45	3.58
PVCNN[27]	62.57	6.06	86.38	2.46	86.92	2.36
Ours	91.90	1.06	96.43	0.51	96.63	0.44
Gain	+2.17	-0.14	+2.53	-0.30	+1.96	-0.27

Table 2: Top-1 accuracy(%)

\uparrow

and EER(%)

\downarrow

of Baseline, MixUp,and StarMix methods on VERA220 based on different CNNs.

VERA220	Baseline		MixUp		StarMix
VERA220	Acc $\uparrow$	EER $\downarrow$	Acc $\uparrow$	EER $\downarrow$	Acc $\uparrow$	EER $\downarrow$
VGG16[33]	50.82	14.35	90.45	1.53	89.27	1.26
ResNet18[9]	65.82	7.86	94.82	1.12	95.73	0.61
ResNet50	56.18	9.74	87.82	2.25	91.73	1.56
FVRASNet[40]	61.73	7.74	83.36	2.36	89.09	2.28
FVCNN[5]	61.91	6.64	78.64	3.41	79.09	3.50
PVCNN[27]	50.82	11.42	89.55	1.8	89.73	1.19
Ours	85.55	1.81	97.52	0.51	98.09	0.35
Gain	+19.73	-5.93	+2.7	-0.61	+2.36	-0.26

Gating.

The Gating operation aims at emphasizing or suppressing features by learning the weights of $Conv_{1\times 1}$ , combined with the activation function to introduce nonlinearity and complete the feature screening process. The rationale behind selecting the faster ReLU over the SiLU is that the input $\mathcal{Z}$ is activated by ReLU before it is fed into LaKBlock, thus eliminating the need for additional negative semiaxis nonlinear features according to Eq.11:

\displaystyle gating(\mathcal{Z})=ReLU(Conv_{1\times 1}(\mathcal{Z})).

(11)

5 Experiments

To evaluate the efficacy of our method, we conducted experimental comparisons that included a total of 6 benchmarks in 3 cases: baseline, MixUp, and StarMix. For fair validation, we compared some mainstream networks, i.e. VGG16[33], ResNet18[9], and ResNet50; We also compared three classifiers in the vein task: FVCNN[5], PVCNN[27], and FVRasNet[40]. We refer to the median of the last 10 epochs of the test set for Top-1 accuracy as the final results, as the classification performance of the model in that state tends to stabilize, which is a better measure of the model’s final performance and the overfitting degree. The False Acceptance Rate (FAR) and False Rejection Rate (FRR) of the last epoch are considered for computing the Equal Error Rate (EER), which is the error rate at a specific threshold when FAR and FRR become equal. Lower EER values indicate better verification performance. We marked the best and second best results using bold and cyan.

5.1 Dataset information

We choose two large public palm vein datasets for our experiments:

TJU600: The TJU palm vein dataset[46] consists of palm vein images provided for the left and right hands of 300 volunteers. The sample data from each volunteer was collected 2 times, with a time interval between the 2 collections of about 60 days. As 10 vein images were collected from each palm each time, the dataset contains a total of 12,000 images (300 volunteers $\times$ 2 palms $\times$ 10 images $\times$ 2 time periods). The resolution of each image was scaled to 224 $\times$ 224 after ROI normalization.

VERA220: The VERA palm vein dataset[34] contains palm vein images of the left and right hands of 110 volunteers, acquired over two time periods. As 5 palm vein images were collected for each hand at each time, the dataset contains a total of 2200 images (110 volunteers $\times$ 2 palms $\times$ 5 images $\times$ 2 time periods). The resolution of each image was also set to 224 $\times$ 224.

5.2 Experiments implementation details

For TJU600, we divided the total dataset into a training set and a test set, each consisting of 6000 images (300 volunteers $\times$ 2 palms $\times$ 10 images). Similarly, we divided the VERA220 dataset into a training set and a test set, each consisting of 1100 images (110 volunteers $\times$ 2 palms $\times$ 5 images). We use RandomFlip and padding 3 pixels of RandomCrop as the base augmentation methods. The batch size of the experiment was set to 32, and a total of 600 epochs were considered for training; the learning rate was set to 0.1. For VGG16, the learning rate was set to 0.01, adjusted by the cosine scheduler, For the FVCNN, and PVCNN used the Adam[22] optimizer with a learning rate set to 0.0001, weight decay of 0.001; other models’ training was performed by the momentum of 0.9 and a weight decay of 0.0001 by the SGD[21] optimizer. For MixUp and StarMix, $Beta(\alpha,\alpha)$ , $\alpha$ = 1.

5.3 Classification results

Table 1 and Table 2 show the performance of StarMix and LaKNet in terms of 2 metrics: classification accuracy and EER. Table 1 shows that our method outperforms the existing methods on TJU600 for classification performance. StarMix achieves good gains in different models; in particular, in ResNet50 and FVCNN, we achieve 94.13% and 80.45% and outperform vanilla Mixup by +1.26% and +1.26%. Compared to other models, StarLKNet achieves 91.90% accuracy without the augmentation method and again 96.63% with StarMix. Similarly, the results of our method on VERA220, are shown in Table 2. We observe that due to the small VERA220 dataset size, the performance of the model without augmentation is minimal. After using Mixup and StarMix, the performance gets a significant improvement. VGG16 and PVCNN, for instance, get an improvement of +39.63% and +39.18%, respectively. But in the same case, without using augmentation methods, StarLKNet gets also a good result, which favorably demonstrates the ability of StarLKNet for effective feature extraction.

5.4 Robustness

5.4.1 Equal Error Rate

In the biometric identification task, positive and negative samples are usually highly imbalanced, and there is a clear trade-off relationship between FAR and FRR. Consequently, identification technology necessitates rigorous criteria for model performance evaluation. The conventional single index of accuracy is insufficient to fully reflect model performance and stability. The ROC (Receiver Operating Characteristic) curve and the EER offer a comprehensive evaluation of a biometric model’s performance in practical applications.

In the four subfigures of Figure.5, we observe that our method achieves the best ROC curves both with and without StarMix, and it can also be seen that the gain for the model with the use of StarMix is significant.

5.4.2 Occlusion

To analyze the robustness of StarLKNet to random occlusions [25], we built test datasets from TJU600 and VERA220, occluded by random 16 $\times$ 16 size patches, with the occlusion ratio gradually increasing from 0% to 10%. We input the occluded test sets as inputs into different models for classification. Since the features of vein images are continuous and discrete, a mask patch may cause the features to be interrupted when it falls in a region with continuous veins. This is challenging to the recognition model and and offers an additional configuration to assess the model’s robustness to vein features.

From Figure 6, we observe that StarLKNet achieves the highest accuracy for occlusion ratios from 0% $\to$ 8%, which shows that a Large Kernel is beneficial for robustness. Compared to the baseline, StarMix is also shown to be beneficial for robustness.

5.5 Ablation Study

Our ablation experiments aim at analyzing the effectiveness of the StarMix augmentation method, the Gating module, and the Large Kernel module in LaKNet.

•

We perform experiments using ResNet18 and FVRasNet on the VERA220 dataset. The aim is to verify the effect of StarMix as well as the [0.3,0.7] threshold setting. Table 3 demonstrates that applying StarMix is detrimental to the model in certain instances, resulting in a 0.1% reduction in Top-1 accuracy in comparison to the MixUp test set. As illustrated in Figure 3, the mask is observed to concentrate on the central region when $\lambda$ is less than 0.3 and on the periphery when it is greater than 0.7. When $\lambda$ is between 0.3 and 0.7, however, the mask achieves an additional boost by outperforming MixUp by +0.91% and +5.73% on ResNet18 and FVRasNet, respectively.
•

In Table 4, we have performed ablation experiments on each module of LaKNet to verify the effect of the corresponding module. It can be seen that the $Conv$ $set$ provides +0.36% improvement compared to LaKNet with only $Conv_{K\times K}$ and a $Conv_{5\times 5}$ . The gating operation also helps the model’s classification performance by +0.74%. This is reasonable because the $Conv$ $set$ expands the Effective Receptive Field using dilated Convolutions, which allows the model to capture more global features. The gating module filters the features, which further retains the valid high-dimensional features.

Table 3: Ablation experiments about StarMix and threshold setting on the VERA220 based ResNet18 and FVRasNet.

VERA220	ResNet18	FVRasNet
Baseline	65.82	61.73
Mixup	94.82	83.36
StarMix	94.72	87.37
w threshold	95.73	89.09
Gain	+1.01	+1.72

Table 4: Ablation experiments about LakNet baseline and their modules effective on the VERA220 and TJU600 datasets.

Modules	TJU600	VERA220
ResNet18	89.73	65.82
LaKNet	89.78	84.45
+ $Conv$ $set$	90.12	84.81
+ Gating	91.90	85.55
Gain	+1.78	+0.74

6 Conclusion

In this paper, we have proposed StarLKNet, a Conv-based palm vein identification network with large kernels, that incorporates the StarMix data enhancement method and the LaKNet structure. StarMix generates masks for image mixing through Gaussian functions, which effectively reduces the overfitting problem caused by insufficient samples in the dataset and also improves the robustness of the model. The LaKNet network captures more globally effective features through large a Effective Receptive Field combined with the screening capability of the gating mechanism, thus stabilizing the recognition capability of the model. A series of classification and analysis experiments have been conducted to validate the outstanding performance of our method.

Future work.

Our future work will further explore how to increase the size of the Effective Receptive Field and the extent to which the latter affects model performance. We will also investigate methods to improve the time overhead problem caused by large convolutional kernels. Concerning the mixup method, we will attempt to develop an end-to-end approach.

References

Chou et al. [2020] H.-P. Chou, S.-C. Chang, J.-Y. Pan, W. Wei, and D.-C. Juan. Remix: rebalanced mixup. In Computer Vision–ECCV 2020 Workshops: Glasgow, UK, August 23–28, 2020, Proceedings, Part VI 16, pages 95–110. Springer, 2020.
Chugh et al. [2018] T. Chugh, K. Cao, and A. K. Jain. Fingerprint spoof buster: Use of minutiae-centered patches. IEEE Transactions on Information Forensics and Security, 13(9):2190–2202, 2018.
Cola et al. [2016] G. Cola, M. Avvenuti, F. Musso, and A. Vecchio. Gait-based authentication using a wrist-worn device. In Proceedings of the 13th International Conference on Mobile and Ubiquitous Systems: Computing, Networking and Services, pages 208–217, 2016.
da Wu and Liu [2011] J. da Wu and C.-T. Liu. Finger-vein pattern identification using svm and neural network technique. Expert Syst. Appl., 38:14284–14289, 2011.
Das et al. [2018] R. Das, E. Piciucco, E. Maiorana, and P. Campisi. Convolutional neural network for finger-vein-based biometric identification. IEEE Transactions on Information Forensics and Security, 14(2):360–373, 2018.
Ding et al. [2022] X. Ding, X. Zhang, J. Han, and G. Ding. Scaling up your kernels to 31x31: Revisiting large kernel design in cnns. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11963–11975, 2022.
Ding et al. [2023] X. Ding, Y. Zhang, Y. Ge, S. Zhao, L. Song, X. Yue, and Y. Shan. Unireplknet: A universal perception large-kernel convnet for audio, video, point cloud, time-series and image recognition. ArXiv, abs/2311.15599, 2023.
Dosovitskiy et al. [2021] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations (ICLR), 2021.
He et al. [2016] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016.
Huang et al. [2010] B. Huang, Y. Dai, R. Li, D. Tang, and W. Li. Finger-vein authentication based on wide line detector and pattern normalization. 2010 20th International Conference on Pattern Recognition, pages 1269–1272, 2010.
Kang et al. [2019] W. Kang, Y. Lu, D. Li, and W. Jia. From noise to feature: Exploiting intensity distribution as a novel soft biometric trait for finger vein recognition. IEEE Transactions on Information Forensics and Security, 14:858–869, 2019.
Kim et al. [2020a] J. Kim, W. Choo, H. Jeong, and H. O. Song. Co-mixup: Saliency guided joint mixup with supermodular diversity. In International Conference on Learning Representations, 2020a.
Kim et al. [2020b] J.-H. Kim, W. Choo, and H. O. Song. Puzzle mix: Exploiting saliency and local statistics for optimal mixup. In International Conference on Machine Learning, pages 5275–5285. PMLR, 2020b.
Li et al. [2021] S. Li, Z. Liu, Z. Wang, D. Wu, Z. Liu, and S. Z. Li. Boosting discriminative visual representation learning with scenario-agnostic mixup. ArXiv, abs/2111.15454, 2021.
Li et al. [2024] S. Li, Z. Wang, Z. Liu, C. Tan, H. Lin, D. Wu, Z. Chen, J. Zheng, and S. Z. Li. Moganet: Multi-order gated aggregation network. In International Conference on Learning Representations, 2024.
Liu et al. [2022a] S. Liu, T. Chen, X. Chen, X. Chen, Q. Xiao, B. Wu, M. Pechenizkiy, D. C. Mocanu, and Z. Wang. More convnets in the 2020s: Scaling up kernels beyond 51x51 using sparsity. ArXiv, abs/2207.03620, 2022a.
Liu et al. [2017] W. Liu, W. Li, L. Sun, L. Zhang, and P. Chen. Finger vein recognition based on deep learning. 2017 12th IEEE Conference on Industrial Electronics and Applications (ICIEA), pages 205–210, 2017.
Liu et al. [2018] Y. Liu, F. Wei, J. Shao, L. Sheng, J. Yan, and X. Wang. Exploring disentangled feature representation beyond face identification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2080–2089, 2018.
Liu et al. [2022b] Z. Liu, S. Li, D. Wu, Z. Liu, Z. Chen, L. Wu, and S. Z. Li. Automix: Unveiling the power of mixup for stronger classifiers. In European Conference on Computer Vision, pages 441–458. Springer, 2022b.
Liu et al. [2023] Z. Liu, S. Li, G. Wang, L. Wu, C. Tan, and S. Z. Li. Harnessing hard mixed samples with decoupled regularizer. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
Loshchilov and Hutter [2016] I. Loshchilov and F. Hutter. Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983, 2016.
Loshchilov and Hutter [2019] I. Loshchilov and F. Hutter. Decoupled weight decay regularization. In International Conference on Learning Representations, 2019.
Mathur and Matari’c [2020] L. Mathur and M. J. Matari’c. Introducing representations of facial affect in automated multimodal deception detection. Proceedings of the 2020 International Conference on Multimodal Interaction, 2020.
Miura et al. [2007] N. Miura, A. Nagasaka, and T. Miyatake. Extraction of finger-vein patterns using maximum curvature points in image profiles. IEICE Trans. Inf. Syst., 90-D:1185–1194, 2007.
Naseer et al. [2021] M. Naseer, K. Ranasinghe, S. H. Khan, M. Hayat, F. S. Khan, and M.-H. Yang. Intriguing properties of vision transformers. In Neural Information Processing Systems, 2021.
Olsson et al. [2021] V. Olsson, W. Tranheden, J. Pinto, and L. Svensson. Classmix: Segmentation-based data augmentation for semi-supervised learning. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pages 1369–1378, 2021.
Qin et al. [2021] H. Qin, M. A. El-Yacoubi, Y. Li, and C. Liu. Multi-scale and multi-direction gan for cnn-based single palm-vein identification. IEEE Transactions on Information Forensics and Security, 16:2652–2666, 2021.
Qin et al. [2023a] H. Qin, R. Hu, M. A. El-Yacoubi, Y. Li, and X. Gao. Local attention transformer-based full-view finger-vein identification. IEEE Transactions on Circuits and Systems for Video Technology, 33:2767–2782, 2023a.
Qin et al. [2023b] H. Qin, X. **, Y. Jiang, M. A. El-Yacoubi, and X. Gao. Adversarial automixup. arXiv preprint arXiv:2312.11954, 2023b.
Qin et al. [2024] H. Qin, H. Zhu, X. **, Q. Song, M. A. El-Yacoubi, and X. Gao. Emmixformer: Mix transformer for eye movement recognition. arXiv preprint arXiv:2401.04956, 2024.
Qin et al. [2020] J. Qin, J. Fang, Q. Zhang, W. Liu, X. Wang, and X. Wang. Resizemix: Mixing data with preserved object information and true labels. arXiv preprint arXiv:2012.11101, 2020.
Radzi et al. [2016] S. A. Radzi, M. Khalil-Hani, and R. Bakhteri. Finger-vein biometric identification using convolutional neural network. Turkish Journal of Electrical Engineering and Computer Sciences, 24:1863–1878, 2016.
Simonyan and Zisserman [2014] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
Tome and Marcel [2015] P. Tome and S. Marcel. On the vulnerability of palm vein recognition to spoofing attacks. In 2015 International Conference on Biometrics (ICB), pages 319–325. IEEE, 2015.
Uddin et al. [2020] A. S. Uddin, M. S. Monira, W. Shin, T. Chung, and S.-H. Bae. Saliencymix: A saliency guided data augmentation strategy for better regularization. In International Conference on Learning Representations, 2020.
Verma et al. [2019] V. Verma, A. Lamb, C. Beckham, A. Najafi, I. Mitliagkas, D. Lopez-Paz, and Y. Bengio. Manifold mixup: Better representations by interpolating hidden states. In International Conference on Machine Learning, pages 6438–6447, 2019.
Wang et al. [2020] G. Wang, C. Sun, and A. Sowmya. Multi-weighted co-occurrence descriptor encoding for vein recognition. IEEE Transactions on Information Forensics and Security, 15:375–390, 2020.
Wu et al. [2023] Y. Wu, H. Liao, H. Zhu, X. **, S. Yang, and H. Qin. Adversarial contrastive learning based on image generation for palm vein recognition. In 2023 2nd International Conference on Artificial Intelligence and Intelligent Information Processing (AIIIP), pages 18–24. IEEE, 2023.
Yang et al. [2022] S. Yang, Y. Wu, X. **, M. El Yacoubi, and H. Qin. Cgan-da: A cross-modality domain adaptation model for hand-vein biometric-based authentication. JOURNAL OF Cyber-Physical-Social Intelligence, 1:3–12, 2022.
Yang et al. [2020] W. Yang, W. Luo, W. Kang, Z. Huang, and Q. Wu. Fvras-net: An embedded finger-vein recognition and antispoofing system using a unified cnn. IEEE Transactions on Instrumentation and Measurement, 69(11):8690–8701, 2020.
Yao et al. [2022] H. Yao, Y. Wang, L. Zhang, J. Y. Zou, and C. Finn. C-mixup: Improving generalization in regression. Advances in neural information processing systems, 35:3361–3376, 2022.
Yoo et al. [2020] J. Yoo, N. Ahn, and K.-A. Sohn. Rethinking data augmentation for image super-resolution: A comprehensive analysis and a new strategy. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8375–8384, 2020.
Yu et al. [2023] W. Yu, P. Zhou, S. Yan, and X. Wang. Inceptionnext: When inception meets convnext. ArXiv, abs/2303.16900, 2023.
Yun et al. [2019] S. Yun, D. Han, S. J. Oh, S. Chun, J. Choe, and Y. Yoo. Cutmix: Regularization strategy to train strong classifiers with localizable features. In International Conference on Computer Vision (ICCV), pages 6023–6032, 2019.
Zhang et al. [2018a] H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz. mixup: Beyond empirical risk minimization. In International Conference on Learning Representations, 2018a.
Zhang et al. [2018b] L. Zhang, Z. Cheng, Y. Shen, and D. Wang. Palmprint and palmvein recognition based on dcnn and a new large-scale contactless palmvein dataset. Symmetry, 10(4):78, 2018b.
Zhou and Kumar [2011] Y. Zhou and A. Kumar. Human identification using palm-vein images. IEEE Transactions on Information Forensics and Security, 6(4):1259–1274, 2011.

StarLKNet: Star Mixup with Large Kernel Networks for Palm Vein Identification