StarLKNet: Star Mixup with Large Kernel Networks
for Palm Vein Identification

Xin **\orcid0009-0005-0983-6853111Equal contribution.    Hongyu Zhu\orcid0009-0000-5993-466622footnotemark: 2    Mounîm A. El Yacoubi    Hongchao Liao    Hufeng Qin    Yun Jiang Corresponding Author. Email: [email protected] Chongqing Technology and Business University, China Telecom SudParis, Institut Polytechnique de Paris, France
Abstract

As a representative of a new generation of biometrics, vein identification technology offers a high level of security and convenience. Convolutional neural networks (CNNs), a prominent class of deep learning architectures, have been extensively utilized for vein identification. Since their performance and robustness are limited by small Effective Receptive Fields (e.g. 3×\times×3 kernels) and insufficient training samples, however, they are unable to extract global feature representations from vein images in an effective manner. To address these issues, we propose StarLKNet, a large kernel convolution-based palm-vein identification network, with the Mixup approach. Our StarMix learns effectively the distribution of vein features to expand samples. To enable CNNs to capture comprehensive feature representations from palm-vein images, we explored the effect of convolutional kernel size on the performance of palm-vein identification networks and designed LaKNet, a network leveraging large kernel convolution and gating mechanism. In light of the current state of knowledge, this represents an inaugural instance of the deployment of a CNN with large kernels in the domain of vein identification. Extensive experiments were conducted to validate the performance of StarLKNet on two public palm-vein datasets. The results demonstrated that StarMix provided superior augmentation, and LakNet exhibited more stable performance gains compared to mainstream approaches, resulting in the highest recognition accuracy and lowest identification error.

\paperid

1492

1 Introduction

The issue of personal information security has received increasing attention in modern society, as misidentification may have a catastrophic impact on personal property security and privacy. Token-based authentication methods such as passwords and ID cards are at risk of being forgotten or stolen. In recent decades, there has been a great deal of research conducted on biometrics technology, based on the identification of individuals through their physiological(e.g. face[18], fingerprint[2] and vein[38, 39]) or behavioral(e.g. gait[3] and eye movement[30]) characteristics. The most common biometric features used in applications are faces and fingerprints. However, these external features may be subject to potential forgery attacks[23]. In contrast, the advantages of vein recognition are significant. Veins and blood vessels are located inside the human body and are not easily affected by the external environment(e.g. skin moisture and wear). Furthermore, because deoxygenated hemoglobin only exists in the living body, vein recognition technology has inherent liveness detection[28].

Refer to caption
Figure 1: Left: Top-1 Accuracy(\uparrow) using MixUp and StarMix in different models; Right: StarMix and ResNet18 fitting curves on the VERA220 dataset; StarMix fits faster and with higher classification accuracy.

In the field of vein recognition, traditional methods rely on hand-crafted feature extraction and shallow machine learning algorithms for classification[11, 37]. These methods are based on some assumptions that the distributions of vein patterns show a valley or a line-like shape, and ignore valid information in higher dimensions. Methods based on deep learning can achieve end-to-end feature extraction without prior assumptions, but the training of network parameters requires a large amount of data support. Unfortunately, it is challenging to obtain a substantial number of samples from each class in practical applications due to limited storage and privacy policies. How to train a robust high-performance network with limited data is a pressing problem.

Refer to caption
Figure 2: StarLKNet training workflow. After getting the mixing ratio, choose different mixing methods according to the threshold setting, get the mixed samples according to the different mixing methods, and then go through the encoder to get the final prediction and then calculate the loss to complete the training once.

To address the overfitting issue arising from insufficient training data, researchers have proposed data augmentation (DA) techniques, through either hand-crafted or generative means to generate new data for expanding the training set. Mixup uses global linear interpolation to mix two or more samples. It has been widely utilized and improved by researchers, (e.g., CutMix[44], AutoMix[19], PuzzleMix[13], AdAutoMix[29], etc.), thanks to its plug-and-play functionality and minimal additional time overhead. Since its inception, Mixup has demonstrated outperformance and robust generalizability in downstream tasks, including Super-resolution[42], Segmentation[26], Regression[41], and Long-tail[1]. Starting with VGG[33], researchers have gradually shifted their focus away from the use of large kernels in CNNs, opting instead for a stacked approach involving multiple smaller kernels, as overly large kernels result in significant time and memory overheads, and tend to overlook localized, subtle features within images. Nevertheless, some recent studies[6, 16, 43, 7] have indicated that with some subtle improvements, CNNs with a larger effective sense field can rival the performance of Vision Transformer (ViT)[8]. Due to the continuous and sparse distribution of vein image features, we posit that large convolutional kernel networks have a significant advantage over small ones in capturing vein features’ distribution. As shown in Figure 1, the large kernel exhibits faster fitting speeds and higher classification accuracies than the small kernel framework represented by ResNet18[9].

To address the challenges of feature extraction from vein images, we propose a Mixup method for vein images to augment the training data. Furthermore, we design a high-performance palm vein recognition network framework based on the stacked formation of large kernel convolutional modules, in order to extract a more comprehensive and robust feature representation of palm vein images. First, we propose StarMix, featuring a more suitable mask for vein image mixing, with the mixing parameter generated from a Gaussian function. In our scheme, the network is free to choose the mixing strategy as either StarMix or vanilla Mixup with a threshold. Furthermore, we propose LaKNet, a network with convolutional and gating modules. The convolutional module comprises large kernel convolution and small kernel convolution, employed for extracting global and local features. The gating module facilitates feature filtering by learning to control the flow of feature information. Experimental results indicate that the StarLKNet model exhibits superior performance compared to the ResNet18 model on the VERA220[34] dataset, with an improvement in the test set top-1 accuracy of +19.73% without augmentation.

In summary, our main contributions are as follows:

  • We rethink the impact of convolutional kernel size on network performance in the vein recognition task, and find that for vein images with continuous and sparse feature distributions, increasing the Effective Receptive Field can significantly improve network performance.

  • We propose StarLKNet, a novel network framework designed for vein recognition, that employs a large kernel convolutional module and a gating module to achieve comprehensive and robust feature extraction. Our evaluation on two large public palm vein datasets demonstrates that StarLKNet outperforms existing methods in terms of recognition accuracy and validation error.

  • We propose StarMix, a data augmentation method that utilizes a Gaussian function to generate, for mixing, suitable masks for vein image feature distribution, thereby significantly enhancing classifier performance.

2 Related Work

Palm Vein Identification.

As Hemoglobin is absorbed in the infrared spectrum, infrared cameras can acquire vein images. Such an acquisition method and the characteristics of vein distribution bring challenges for feature extraction. Research made to address this challenge can be broadly categorized into two types of traditional methods, handcrafted features extracted and input to shallow machine learning, and CNN-based feature extraction methods: 1) In the first category, Miura et al.[24], for instance, performed repeated line tracking to detect valley shapes in cross-sectional vein patterns and extract finger-vein texture for verification. Other works, assuming that vein patterns in a predefined neighbor can be regarded as line segments, have proposed line detection methods to extract line-like textures, including Gabor-based[47] and wide line detectors[10]. As traditional shallow machine learning methods used in the vein domain, [4], for instance, employed principal component analysis (PCA) and linear discriminant analysis (LDA) for feature dimensionality reduction, combined with support vector machine (SVM) for feature classification; 2) CNNs demonstrated remarkable capability in extracting features in vein recognition tasks. Syafeeza et al., for instance, proposed a CNN with four layers for finger-vein identification[32], and later employed a pre-trained VGG16[33] model and an enhanced Conv with seven layers to identify the finger-veins [17].

MixUp.

The inception of mixing augmentation methods started with MixUp[45], which consists of A static linear interpolation of two samples according to a mixing ratio from 0 to 1 to obtain a mixed sample. CutMix[44] converts the MixUp sample from image pixels to space level, generating a mixing ratio-sized mask to randomly mix patches. Subsequently, several methods were proposed to improve the sample mixing policies or label mixing policies [36, 31, 14, 20]. In contrast to hand-crafted methods, SaliencyMix[35] obtains saliency information through an additional feature extractor, with guided mixing of the samples. With a similar aim, PuzzleMix[13] and Co-Mix[12] utilize gradient information for backward propagation to locate feature regions and employ an optimal transport scheme to avoid overlap** feature information by feature maximization in the mixed samples. AutoMix[19] adopts an end-to-end approach by designing a generator mix block that optimizes both the generator and the model, achieving an optimal result in terms of time overhead and performance. More recently, AdAutoMix[29], built upon AutoMix, has been proposed to augment the generated samples by mixing any set of N𝑁Nitalic_N samples and not only two; AdAutoMix also proposed adversarial training to prevent generator overfitting, by pushing the generator to generate difficult samples with more impact on improving training performance.

3 Preliminaries

3.1 Mixup

We define 𝕏𝕏\mathbb{X}blackboard_X to be the set of training samples and 𝕐𝕐\mathbb{Y}blackboard_Y the set of ground truth of the corresponding labels. For each sample pair (x,y)𝑥𝑦(x,y)( italic_x , italic_y ), xw,h,c𝑥superscript𝑤𝑐x\in\mathbb{R}^{w,h,c}italic_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_w , italic_h , italic_c end_POSTSUPERSCRIPT, and yC𝑦superscript𝐶y\in\mathbb{R}^{C}italic_y ∈ blackboard_R start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT is the corresponding one-hot label. where w,h,c𝑤𝑐w,h,citalic_w , italic_h , italic_c are the sample’s width, length, and channel, respectively; C𝐶Citalic_C is the number of sample classes. We mix the sample pairs (xi,yi)subscript𝑥𝑖subscript𝑦𝑖(x_{i},y_{i})( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ), (xj,yj)subscript𝑥𝑗subscript𝑦𝑗(x_{j},y_{j})( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) by linear interpolation according to MixUp to obtain mixed samples and labels:

x^=λxi+(1λ)xj,^𝑥𝜆subscript𝑥𝑖1𝜆subscript𝑥𝑗\displaystyle\hat{x}=\lambda*x_{i}+(1-\lambda)*x_{j},over^ start_ARG italic_x end_ARG = italic_λ ∗ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + ( 1 - italic_λ ) ∗ italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , (1)
y^=λyi+(1λ)yj,^𝑦𝜆subscript𝑦𝑖1𝜆subscript𝑦𝑗\displaystyle\hat{y}=\lambda*y_{i}+(1-\lambda)*y_{j},over^ start_ARG italic_y end_ARG = italic_λ ∗ italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + ( 1 - italic_λ ) ∗ italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ,

where λ𝜆\lambdaitalic_λ is the mixing ratio from the Beta(α,α)𝐵𝑒𝑡𝑎𝛼𝛼Beta(\alpha,\alpha)italic_B italic_e italic_t italic_a ( italic_α , italic_α ) distribution. We map the mixed sample x^^𝑥\hat{x}over^ start_ARG italic_x end_ARG to its label y^^𝑦\hat{y}over^ start_ARG italic_y end_ARG by a deep neural network fθsubscript𝑓𝜃f_{\theta}italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT: fθ(x^)y^maps-tosubscript𝑓𝜃^𝑥^𝑦f_{\theta}(\hat{x})\mapsto\hat{y}italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( over^ start_ARG italic_x end_ARG ) ↦ over^ start_ARG italic_y end_ARG. f𝑓fitalic_f trains the network vector parameter θ𝜃\thetaitalic_θ continuously by minimizing the loss function, i.e..

3.2 Gating

The MogaNet gating mechanism in our network comprises Conv1×1𝐶𝑜𝑛subscript𝑣11Conv_{1\times 1}italic_C italic_o italic_n italic_v start_POSTSUBSCRIPT 1 × 1 end_POSTSUBSCRIPT and SiLU𝑆𝑖𝐿𝑈SiLUitalic_S italic_i italic_L italic_U activation function[15]. It provides a simple yet effective structure for filtering the flow of information during network training, thereby enhancing the efficacy of the extracted features. The objective of utilizing SiLU𝑆𝑖𝐿𝑈SiLUitalic_S italic_i italic_L italic_U is to integrate the smoothness and nonlinearity of the Sigmoid𝑆𝑖𝑔𝑚𝑜𝑖𝑑Sigmoiditalic_S italic_i italic_g italic_m italic_o italic_i italic_d function with the linear properties of the ReLU linear unit within the positive region:

SiLU(z)=zSigmoid(z).𝑆𝑖𝐿𝑈𝑧𝑧𝑆𝑖𝑔𝑚𝑜𝑖𝑑𝑧\displaystyle SiLU(z)=z\cdot Sigmoid(z).italic_S italic_i italic_L italic_U ( italic_z ) = italic_z ⋅ italic_S italic_i italic_g italic_m italic_o italic_i italic_d ( italic_z ) . (2)

Specifically, Conv1×1𝐶𝑜𝑛subscript𝑣11Conv_{1\times 1}italic_C italic_o italic_n italic_v start_POSTSUBSCRIPT 1 × 1 end_POSTSUBSCRIPT is employed to implement the linear transformation of features, whereby the weights are adjusted to emphasize or suppress the features. The output of Conv1×1𝐶𝑜𝑛subscript𝑣11Conv_{1\times 1}italic_C italic_o italic_n italic_v start_POSTSUBSCRIPT 1 × 1 end_POSTSUBSCRIPT, zw,h,c𝑧superscript𝑤𝑐z\in\mathbb{R}^{w,h,c}italic_z ∈ blackboard_R start_POSTSUPERSCRIPT italic_w , italic_h , italic_c end_POSTSUPERSCRIPT, is utilized as input to SiLU𝑆𝑖𝐿𝑈SiLUitalic_S italic_i italic_L italic_U, which is capable of adjusting its activation value according to the importance of the input features, thereby realizing the effect of the gating mechanism:

gating(z)=SiLU(Conv1×1(z)).𝑔𝑎𝑡𝑖𝑛𝑔𝑧𝑆𝑖𝐿𝑈𝐶𝑜𝑛subscript𝑣11𝑧\displaystyle gating(z)=SiLU(Conv_{1\times 1}(z)).italic_g italic_a italic_t italic_i italic_n italic_g ( italic_z ) = italic_S italic_i italic_L italic_U ( italic_C italic_o italic_n italic_v start_POSTSUBSCRIPT 1 × 1 end_POSTSUBSCRIPT ( italic_z ) ) . (3)

4 StarLKNet

Our proposed StarLKNet model, illustrated in Figure 2, consists of two components, StarMix and LaKNet. StarMix employs a mask generated by a Gaussian function to mix and augment the data, while LaKNet consists of a convolutional module with a large kernel and a gating module. In the next section, we first introduce our hybrid method StarMix, and then detail LaKNet.

Refer to caption
Figure 3: From left \rightarrow right are the visualizations of StarMask with different λ𝜆\lambdaitalic_λ.
0:  Beta(α,α)𝐵𝑒𝑡𝑎𝛼𝛼Beta(\alpha,\alpha)italic_B italic_e italic_t italic_a ( italic_α , italic_α ) distribution, training samples and labels 𝕏𝕏\mathbb{X}blackboard_X, 𝕐𝕐\mathbb{Y}blackboard_Y, mixing ratio λ𝜆\lambdaitalic_λ, threshold [0.3, 0.7], Gaussian function Ga()𝐺𝑎Ga(\cdot)italic_G italic_a ( ⋅ ) and k𝑘kitalic_k is kernel size.
1:  xw,h,c𝑥superscript𝑤𝑐x\in\mathbb{R}^{w,h,c}italic_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_w , italic_h , italic_c end_POSTSUPERSCRIPT
2:  for xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, yisubscript𝑦𝑖y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT in 𝕏𝕏\mathbb{X}blackboard_X, 𝕐𝕐\mathbb{Y}blackboard_Y loder do
3:     λ𝜆\lambdaitalic_λ = Beta(α,α)𝐵𝑒𝑡𝑎𝛼𝛼Beta(\alpha,\alpha)italic_B italic_e italic_t italic_a ( italic_α , italic_α ),
4:     xjsubscript𝑥𝑗x_{j}italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, yjsubscript𝑦𝑗y_{j}italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = torch.randperm(xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, yisubscript𝑦𝑖y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT),
5:     if 0.3 λabsent𝜆absent\leq\lambda\leq≤ italic_λ ≤ 0.7 then
6:        G𝐺Gitalic_G = Ga(k,h,σ),𝐺𝑎𝑘𝜎Ga(k,h,\sigma),italic_G italic_a ( italic_k , italic_h , italic_σ ) ,
7:        λ^^𝜆\hat{\lambda}over^ start_ARG italic_λ end_ARG according to the Eq.7,
8:        x^=Gxi+(1G)xj^𝑥𝐺subscript𝑥𝑖1𝐺subscript𝑥𝑗\hat{x}=G*x_{i}+(1-G)*x_{j}over^ start_ARG italic_x end_ARG = italic_G ∗ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + ( 1 - italic_G ) ∗ italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT,
y^=λ^yi+(1λ^)yj^𝑦^𝜆subscript𝑦𝑖1^𝜆subscript𝑦𝑗\hat{y}=\hat{\lambda}*y_{i}+(1-\hat{\lambda})*y_{j}over^ start_ARG italic_y end_ARG = over^ start_ARG italic_λ end_ARG ∗ italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + ( 1 - over^ start_ARG italic_λ end_ARG ) ∗ italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT.
9:     else
10:        x^=λxi+(1λ)xj^𝑥𝜆subscript𝑥𝑖1𝜆subscript𝑥𝑗\hat{x}=\lambda*x_{i}+(1-\lambda)*x_{j}over^ start_ARG italic_x end_ARG = italic_λ ∗ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + ( 1 - italic_λ ) ∗ italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT,
y^=λyi+(1λ)yj^𝑦𝜆subscript𝑦𝑖1𝜆subscript𝑦𝑗\hat{y}=\lambda*y_{i}+(1-\lambda)*y_{j}over^ start_ARG italic_y end_ARG = italic_λ ∗ italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + ( 1 - italic_λ ) ∗ italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT.
11:     end if
12:  end for
Algorithm 1 StarMix pseudo-code process.

4.1 StarMix

This section details the StarMix method; the pseudo-code for StarMix is provided in Algorithm 1.

StarMask.

As shown in Figure 3, compared to λ𝜆\lambdaitalic_λ in the vanilla MixUp, the mixing mask StarMask𝑆𝑡𝑎𝑟𝑀𝑎𝑠𝑘StarMaskitalic_S italic_t italic_a italic_r italic_M italic_a italic_s italic_k, focusing more on the surroundings and center of the samples, adapts well to the distribution of the vein features. To obtain StarMask𝑆𝑡𝑎𝑟𝑀𝑎𝑠𝑘StarMaskitalic_S italic_t italic_a italic_r italic_M italic_a italic_s italic_k, we define a Gaussian function Ga()𝐺𝑎Ga(\cdot)italic_G italic_a ( ⋅ ) to generate the mask G𝐺Gitalic_G. We assume a pair of samples (xi,yi)subscript𝑥𝑖subscript𝑦𝑖(x_{i},y_{i})( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ), (xj,yj)subscript𝑥𝑗subscript𝑦𝑗(x_{j},y_{j})( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ), with sample xw,h,c𝑥superscript𝑤𝑐x\in\mathbb{R}^{w,h,c}italic_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_w , italic_h , italic_c end_POSTSUPERSCRIPT, and we consider a mixing parameter drawn λ𝜆\lambdaitalic_λ from Beta(α,α)𝐵𝑒𝑡𝑎𝛼𝛼Beta(\alpha,\alpha)italic_B italic_e italic_t italic_a ( italic_α , italic_α ). The Gaussian function is given by Eq.4:

Ga(k,w,h,σ)=e(xGk2)22σ2,𝐺𝑎𝑘𝑤𝜎superscriptesuperscriptsubscript𝑥𝐺𝑘222superscript𝜎2Ga(k,w,h,\sigma)=\mathrm{e}^{-\frac{\left(x_{G}-\frac{k}{2}\right)^{2}}{2% \sigma^{2}}},italic_G italic_a ( italic_k , italic_w , italic_h , italic_σ ) = roman_e start_POSTSUPERSCRIPT - divide start_ARG ( italic_x start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT - divide start_ARG italic_k end_ARG start_ARG 2 end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_POSTSUPERSCRIPT , (4)

where k𝑘kitalic_k is the Gaussian kernel‘s size, set to 224, xGsubscript𝑥𝐺x_{G}italic_x start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT is the region division of the mask, xG=[k/2:kk/2,k/2:k/2k]x_{G}=[k/2:k-k/2,\quad k/2:k/2-k]italic_x start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT = [ italic_k / 2 : italic_k - italic_k / 2 , italic_k / 2 : italic_k / 2 - italic_k ], σ=λh𝜎𝜆\sigma=\lambda*hitalic_σ = italic_λ ∗ italic_h. Passing the parameters to Ga()𝐺𝑎Ga(\cdot)italic_G italic_a ( ⋅ ) as in Eq.5 allows us to obtain the mask M𝑀Mitalic_M:

M=1Nn=1NGa(kn,h,σn),𝑀1𝑁superscriptsubscript𝑛1𝑁𝐺𝑎subscript𝑘𝑛subscript𝜎𝑛M=\frac{1}{N}\sum_{n=1}^{N}Ga(k_{n},h,\sigma_{n}),italic_M = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_G italic_a ( italic_k start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_h , italic_σ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) , (5)

where N=3𝑁3N=3italic_N = 3. Note that σ2=(1λ)hsubscript𝜎21𝜆\sigma_{2}=(1-\lambda)*hitalic_σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = ( 1 - italic_λ ) ∗ italic_h, σ1=σ3subscript𝜎1subscript𝜎3\sigma_{1}=\sigma_{3}italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_σ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT, and k3=2k1=448subscript𝑘32subscript𝑘1448k_{3}=2k_{1}=448italic_k start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT = 2 italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 448. We then normalize M𝑀Mitalic_M to get the final mask G𝐺Gitalic_G as Eq.6:

G=λ1+eM.𝐺𝜆1superscript𝑒𝑀G=\frac{\lambda}{1+e^{-M}}.italic_G = divide start_ARG italic_λ end_ARG start_ARG 1 + italic_e start_POSTSUPERSCRIPT - italic_M end_POSTSUPERSCRIPT end_ARG . (6)
Refer to caption
Figure 4: The framework of the proposed StarLKNet. Upper::𝑈𝑝𝑝𝑒𝑟absentUpper:italic_U italic_p italic_p italic_e italic_r : Denotes the main module of StarLKNet, containing 1 Stem, 4 Stages, 3 Necks, and an FC layer. Lower::𝐿𝑜𝑤𝑒𝑟absentLower:italic_L italic_o italic_w italic_e italic_r : a comprehensive illustration of the functions and specific operations within each module.

After we get the mask G𝐺Gitalic_G, we need to recalculate the correct ratio for the mixing ratio λ𝜆\lambdaitalic_λ according to the Eq.7:

λ^=i=1wj=1hGi,jw×h,^𝜆superscriptsubscript𝑖1𝑤superscriptsubscript𝑗1subscript𝐺𝑖𝑗𝑤\hat{\lambda}=\frac{{\textstyle\sum_{i=1}^{w}}\textstyle\sum_{j=1}^{h}G_{i,j}}% {w\times h},over^ start_ARG italic_λ end_ARG = divide start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT italic_G start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT end_ARG start_ARG italic_w × italic_h end_ARG , (7)

where Gi,jsubscript𝐺𝑖𝑗G_{i,j}italic_G start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT denotes the pixel value at the ith row and jth column in G𝐺Gitalic_G. Gi,jsubscript𝐺𝑖𝑗G_{i,j}italic_G start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT can be seen as the average of all pixel values in G𝐺Gitalic_G in terms of intensity. Finally, the mixed sample x^^𝑥\hat{x}over^ start_ARG italic_x end_ARG and the corresponding label y^^𝑦\hat{y}over^ start_ARG italic_y end_ARG are obtained according to Eq.8:

x^=Gxi+(1G)xj,^𝑥𝐺subscript𝑥𝑖1𝐺subscript𝑥𝑗\displaystyle\hat{x}=G*x_{i}+(1-G)*x_{j},over^ start_ARG italic_x end_ARG = italic_G ∗ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + ( 1 - italic_G ) ∗ italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , (8)
y^=λ^yi+(1λ^)yj.^𝑦^𝜆subscript𝑦𝑖1^𝜆subscript𝑦𝑗\displaystyle\hat{y}=\hat{\lambda}*y_{i}+(1-\hat{\lambda})*y_{j}.over^ start_ARG italic_y end_ARG = over^ start_ARG italic_λ end_ARG ∗ italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + ( 1 - over^ start_ARG italic_λ end_ARG ) ∗ italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT .
Threshold setting.

As shown in Figure 2, the obtained masks are not reliable when λ𝜆\lambdaitalic_λ is less than 0.3 or more than 0.7 due to Ga()𝐺𝑎Ga(\cdot)italic_G italic_a ( ⋅ ), so we propose the use of thresholding to avoid obtaining unreliable masks when mixing:

x^^𝑥\displaystyle\hat{x}over^ start_ARG italic_x end_ARG ={Gxi+(1G)xj,if0.3λ0.7,λxi+(1λ)xj,otherwise.\displaystyle=\left\{\begin{matrix}G*x_{i}+(1-G)*x_{j},&if\quad 0.3\leq\lambda% \leq 0.7,\\ \lambda*x_{i}+(1-\lambda)*x_{j},&otherwise.\end{matrix}\right.= { start_ARG start_ROW start_CELL italic_G ∗ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + ( 1 - italic_G ) ∗ italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , end_CELL start_CELL italic_i italic_f 0.3 ≤ italic_λ ≤ 0.7 , end_CELL end_ROW start_ROW start_CELL italic_λ ∗ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + ( 1 - italic_λ ) ∗ italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , end_CELL start_CELL italic_o italic_t italic_h italic_e italic_r italic_w italic_i italic_s italic_e . end_CELL end_ROW end_ARG (9)

When λ𝜆\lambdaitalic_λ is within the [0.3, 0.7] threshold range, StarMix is used, and when λ𝜆\lambdaitalic_λ is outside it, we choose vanilla Mixup. This threshold setting avoids the disadvantages of StarMix in special cases. The benefits of mixing StarMix and vanilla Mixup can be seen in Table 3.

4.2 LaKNet

This subsection presents a detailed description of the LaKNet modules and methods.

Architecture Specification.

LaKNet is composed by the following submodules:

Stem is the first module used as input to the network to capture more details by map** the original samples into a higher dimensional space. Stem comprises a Conv3×3𝐶𝑜𝑛subscript𝑣33Conv_{3\times 3}italic_C italic_o italic_n italic_v start_POSTSUBSCRIPT 3 × 3 end_POSTSUBSCRIPT with step=2 and channel=64. It is then downsampled by a Conv1×1𝐶𝑜𝑛subscript𝑣11Conv_{1\times 1}italic_C italic_o italic_n italic_v start_POSTSUBSCRIPT 1 × 1 end_POSTSUBSCRIPT with step=1 and a DW𝐷𝑊DWitalic_D italic_W Conv3×3𝐶𝑜𝑛subscript𝑣33Conv_{3\times 3}italic_C italic_o italic_n italic_v start_POSTSUBSCRIPT 3 × 3 end_POSTSUBSCRIPT (Depthwise Separable Convolution) with step=2.

Embedding sub-module of the stage is comprised of a Conv1×1𝐶𝑜𝑛subscript𝑣11Conv_{1\times 1}italic_C italic_o italic_n italic_v start_POSTSUBSCRIPT 1 × 1 end_POSTSUBSCRIPT, and a BatchNorm layer. The inputs and outputs of this module are activated by the ReLU𝑅𝑒𝐿𝑈ReLUitalic_R italic_e italic_L italic_U function. The main purpose of Embedding is to aggregate further the feature information of Stage’s inputs in the convolution, to prevent the model from overfitting the training set, and to avoid gradient exploding or vanishing. This is a common practice for most network models.

LaKBlock comprises two branches: a Conv𝐶𝑜𝑛𝑣Convitalic_C italic_o italic_n italic_v module and a gating module. The Conv𝐶𝑜𝑛𝑣Convitalic_C italic_o italic_n italic_v module employs a combination of DW𝐷𝑊DWitalic_D italic_W Conv𝐶𝑜𝑛𝑣Convitalic_C italic_o italic_n italic_v with a large kernel and a Conv𝐶𝑜𝑛𝑣Convitalic_C italic_o italic_n italic_v set𝑠𝑒𝑡setitalic_s italic_e italic_t, to extend the Effective Receptive Field by capturing both global and local features. The Conv𝐶𝑜𝑛𝑣Convitalic_C italic_o italic_n italic_v set𝑠𝑒𝑡setitalic_s italic_e italic_t uses some small convolution kernels, Conv5×5𝐶𝑜𝑛subscript𝑣55Conv_{5\times 5}italic_C italic_o italic_n italic_v start_POSTSUBSCRIPT 5 × 5 end_POSTSUBSCRIPT, to capture local features, and two dilation convolutions aiming at capturing the discrete features without additional overheads. The gating module employs a gating mechanism formed by the combination of Conv1×1𝐶𝑜𝑛subscript𝑣11Conv_{1\times 1}italic_C italic_o italic_n italic_v start_POSTSUBSCRIPT 1 × 1 end_POSTSUBSCRIPT and ReLU𝑅𝑒𝐿𝑈ReLUitalic_R italic_e italic_L italic_U activation functions to get effective features with input 𝒵𝒵\mathcal{Z}caligraphic_Z. We finally merge the output of the Conv𝐶𝑜𝑛𝑣Convitalic_C italic_o italic_n italic_v module and the output of the Gating module as the total output, which is used as input to the Neck layer.

Neck module position is used to connect different stages in the architecture and is used by a Conv1×1𝐶𝑜𝑛subscript𝑣11Conv_{1\times 1}italic_C italic_o italic_n italic_v start_POSTSUBSCRIPT 1 × 1 end_POSTSUBSCRIPT to increase the dimension of the channel input for the next stage.

FC Layer module is a Linear𝐿𝑖𝑛𝑒𝑎𝑟Linearitalic_L italic_i italic_n italic_e italic_a italic_r layer intended to be used for the final classification, where the obtained feature information is subjected to a map** to obtain the final classification probability distribution.

To summarize, the channel dimensions of each stage are distinct and the large convolutional dimensions are different due to the downsampling operations. Consequently, the number of layers, channel dimensions, and large convolutions of each stage in LaKNet are {2,2,18,2}22182\left\{2,2,18,2\right\}{ 2 , 2 , 18 , 2 }, {128,256,512,1024}1282565121024\left\{128,256,512,1024\right\}{ 128 , 256 , 512 , 1024 }, and {31,29,27,13}31292713\left\{31,29,27,13\right\}{ 31 , 29 , 27 , 13 }, respectively.

Kernels mixing.

This paragraph explains the large kernel function Lak()𝐿𝑎𝑘Lak(\cdot)italic_L italic_a italic_k ( ⋅ ) and a set of small kernel function Sak()𝑆𝑎𝑘Sak(\cdot)italic_S italic_a italic_k ( ⋅ ) sets mixing method, aiming to enable the model to capture both local and global features. This allows for stable optimization of the model fθsubscript𝑓𝜃f_{\theta}italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT.

𝒦=Lak(𝒵)+Sak(𝒵),𝒦𝐿𝑎𝑘𝒵𝑆𝑎𝑘𝒵\mathcal{K}=Lak(\mathcal{Z})+Sak(\mathcal{Z}),caligraphic_K = italic_L italic_a italic_k ( caligraphic_Z ) + italic_S italic_a italic_k ( caligraphic_Z ) , (10)

where 𝒦𝒦\mathcal{K}caligraphic_K denotes the embedding layer output, 𝒵𝒵\mathcal{Z}caligraphic_Z was the Lak()𝐿𝑎𝑘Lak(\cdot)italic_L italic_a italic_k ( ⋅ ) and Sak()𝑆𝑎𝑘Sak(\cdot)italic_S italic_a italic_k ( ⋅ ) add output, Sak()𝑆𝑎𝑘Sak(\cdot)italic_S italic_a italic_k ( ⋅ ) was combined with a depthwise Conv5×5𝐶𝑜𝑛subscript𝑣55Conv_{5\times 5}italic_C italic_o italic_n italic_v start_POSTSUBSCRIPT 5 × 5 end_POSTSUBSCRIPT and two depthwise Dilation𝐷𝑖𝑙𝑎𝑡𝑖𝑜𝑛Dilationitalic_D italic_i italic_l italic_a italic_t italic_i italic_o italic_n Conv3×3𝐶𝑜𝑛subscript𝑣33Conv_{3\times 3}italic_C italic_o italic_n italic_v start_POSTSUBSCRIPT 3 × 3 end_POSTSUBSCRIPT with dilation set to 2 and 4.

Table 1: Top-1 accuracy(%)\uparrow and EER(%)\downarrow of Baseline, MixUp,and StarMix methods on TJU600 based on different CNNs.
TJU600 Baseline MixUp StarMix
Acc \uparrow EER \downarrow Acc \uparrow EER \downarrow Acc \uparrow EER \downarrow
VGG16[33] 73.65 4.07 86.08 2.6 85.57 2.52
ResNet18[9] 89.73 1.2 93.90 0.81 94.67 0.71
ResNet50 85.68 1.96 92.87 0.95 94.13 0.93
FVRASNet[40] 71.02 4.19 87.38 2.11 88.50 2.01
FVCNN[5] 58.65 9.02 79.19 3.73 80.45 3.58
PVCNN[27] 62.57 6.06 86.38 2.46 86.92 2.36
Ours 91.90 1.06 96.43 0.51 96.63 0.44
Gain +2.17 -0.14 +2.53 -0.30 +1.96 -0.27

Table 2: Top-1 accuracy(%)\uparrow and EER(%)\downarrow of Baseline, MixUp,and StarMix methods on VERA220 based on different CNNs.
VERA220 Baseline MixUp StarMix
Acc \uparrow EER \downarrow Acc \uparrow EER \downarrow Acc \uparrow EER \downarrow
VGG16[33] 50.82 14.35 90.45 1.53 89.27 1.26
ResNet18[9] 65.82 7.86 94.82 1.12 95.73 0.61
ResNet50 56.18 9.74 87.82 2.25 91.73 1.56
FVRASNet[40] 61.73 7.74 83.36 2.36 89.09 2.28
FVCNN[5] 61.91 6.64 78.64 3.41 79.09 3.50
PVCNN[27] 50.82 11.42 89.55 1.8 89.73 1.19
Ours 85.55 1.81 97.52 0.51 98.09 0.35
Gain +19.73 -5.93 +2.7 -0.61 +2.36 -0.26
Gating.

The Gating operation aims at emphasizing or suppressing features by learning the weights of Conv1×1𝐶𝑜𝑛subscript𝑣11Conv_{1\times 1}italic_C italic_o italic_n italic_v start_POSTSUBSCRIPT 1 × 1 end_POSTSUBSCRIPT, combined with the activation function to introduce nonlinearity and complete the feature screening process. The rationale behind selecting the faster ReLU over the SiLU is that the input 𝒵𝒵\mathcal{Z}caligraphic_Z is activated by ReLU before it is fed into LaKBlock, thus eliminating the need for additional negative semiaxis nonlinear features according to Eq.11:

gating(𝒵)=ReLU(Conv1×1(𝒵)).𝑔𝑎𝑡𝑖𝑛𝑔𝒵𝑅𝑒𝐿𝑈𝐶𝑜𝑛subscript𝑣11𝒵\displaystyle gating(\mathcal{Z})=ReLU(Conv_{1\times 1}(\mathcal{Z})).italic_g italic_a italic_t italic_i italic_n italic_g ( caligraphic_Z ) = italic_R italic_e italic_L italic_U ( italic_C italic_o italic_n italic_v start_POSTSUBSCRIPT 1 × 1 end_POSTSUBSCRIPT ( caligraphic_Z ) ) . (11)

5 Experiments

To evaluate the efficacy of our method, we conducted experimental comparisons that included a total of 6 benchmarks in 3 cases: baseline, MixUp, and StarMix. For fair validation, we compared some mainstream networks, i.e. VGG16[33], ResNet18[9], and ResNet50; We also compared three classifiers in the vein task: FVCNN[5], PVCNN[27], and FVRasNet[40]. We refer to the median of the last 10 epochs of the test set for Top-1 accuracy as the final results, as the classification performance of the model in that state tends to stabilize, which is a better measure of the model’s final performance and the overfitting degree. The False Acceptance Rate (FAR) and False Rejection Rate (FRR) of the last epoch are considered for computing the Equal Error Rate (EER), which is the error rate at a specific threshold when FAR and FRR become equal. Lower EER values indicate better verification performance. We marked the best and second best results using bold and cyan.

5.1 Dataset information

We choose two large public palm vein datasets for our experiments:

TJU600: The TJU palm vein dataset[46] consists of palm vein images provided for the left and right hands of 300 volunteers. The sample data from each volunteer was collected 2 times, with a time interval between the 2 collections of about 60 days. As 10 vein images were collected from each palm each time, the dataset contains a total of 12,000 images (300 volunteers ×\times× 2 palms ×\times× 10 images ×\times× 2 time periods). The resolution of each image was scaled to 224 ×\times× 224 after ROI normalization.

VERA220: The VERA palm vein dataset[34] contains palm vein images of the left and right hands of 110 volunteers, acquired over two time periods. As 5 palm vein images were collected for each hand at each time, the dataset contains a total of 2200 images (110 volunteers ×\times× 2 palms ×\times× 5 images ×\times× 2 time periods). The resolution of each image was also set to 224 ×\times× 224.

5.2 Experiments implementation details

For TJU600, we divided the total dataset into a training set and a test set, each consisting of 6000 images (300 volunteers ×\times× 2 palms ×\times× 10 images). Similarly, we divided the VERA220 dataset into a training set and a test set, each consisting of 1100 images (110 volunteers ×\times× 2 palms ×\times× 5 images). We use RandomFlip and padding 3 pixels of RandomCrop as the base augmentation methods. The batch size of the experiment was set to 32, and a total of 600 epochs were considered for training; the learning rate was set to 0.1. For VGG16, the learning rate was set to 0.01, adjusted by the cosine scheduler, For the FVCNN, and PVCNN used the Adam[22] optimizer with a learning rate set to 0.0001, weight decay of 0.001; other models’ training was performed by the momentum of 0.9 and a weight decay of 0.0001 by the SGD[21] optimizer. For MixUp and StarMix, Beta(α,α)𝐵𝑒𝑡𝑎𝛼𝛼Beta(\alpha,\alpha)italic_B italic_e italic_t italic_a ( italic_α , italic_α ), α𝛼\alphaitalic_α = 1.

5.3 Classification results

Table 1 and Table 2 show the performance of StarMix and LaKNet in terms of 2 metrics: classification accuracy and EER. Table 1 shows that our method outperforms the existing methods on TJU600 for classification performance. StarMix achieves good gains in different models; in particular, in ResNet50 and FVCNN, we achieve 94.13% and 80.45% and outperform vanilla Mixup by +1.26% and +1.26%. Compared to other models, StarLKNet achieves 91.90% accuracy without the augmentation method and again 96.63% with StarMix. Similarly, the results of our method on VERA220, are shown in Table 2. We observe that due to the small VERA220 dataset size, the performance of the model without augmentation is minimal. After using Mixup and StarMix, the performance gets a significant improvement. VGG16 and PVCNN, for instance, get an improvement of +39.63% and +39.18%, respectively. But in the same case, without using augmentation methods, StarLKNet gets also a good result, which favorably demonstrates the ability of StarLKNet for effective feature extraction.

5.4 Robustness

5.4.1 Equal Error Rate

In the biometric identification task, positive and negative samples are usually highly imbalanced, and there is a clear trade-off relationship between FAR and FRR. Consequently, identification technology necessitates rigorous criteria for model performance evaluation. The conventional single index of accuracy is insufficient to fully reflect model performance and stability. The ROC (Receiver Operating Characteristic) curve and the EER offer a comprehensive evaluation of a biometric model’s performance in practical applications.

In the four subfigures of Figure.5, we observe that our method achieves the best ROC curves both with and without StarMix, and it can also be seen that the gain for the model with the use of StarMix is significant.

Refer to caption
Figure 5: (a). shows the ROC curves for different models on TJU600; (b). shows the ROC curves for different models using StarMix on TJU600; (c). shows the ROC curves for different models on VERA220; (d). shows the ROC curves for different models using StarMix on VERA220.

5.4.2 Occlusion

To analyze the robustness of StarLKNet to random occlusions [25], we built test datasets from TJU600 and VERA220, occluded by random 16 ×\times× 16 size patches, with the occlusion ratio gradually increasing from 0% to 10%. We input the occluded test sets as inputs into different models for classification. Since the features of vein images are continuous and discrete, a mask patch may cause the features to be interrupted when it falls in a region with continuous veins. This is challenging to the recognition model and and offers an additional configuration to assess the model’s robustness to vein features.

From Figure 6, we observe that StarLKNet achieves the highest accuracy for occlusion ratios from 0% \to 8%, which shows that a Large Kernel is beneficial for robustness. Compared to the baseline, StarMix is also shown to be beneficial for robustness.

Refer to caption
Figure 6: (a). shows the occlusion performance of models in TJU600 with and without StarMix. StarMix can better deal with the occlusion and with better robustness. (b). shows the different models’ performance about occlusion in VERA220.

5.5 Ablation Study

Our ablation experiments aim at analyzing the effectiveness of the StarMix augmentation method, the Gating module, and the Large Kernel module in LaKNet.

  • We perform experiments using ResNet18 and FVRasNet on the VERA220 dataset. The aim is to verify the effect of StarMix as well as the [0.3,0.7] threshold setting. Table 3 demonstrates that applying StarMix is detrimental to the model in certain instances, resulting in a 0.1% reduction in Top-1 accuracy in comparison to the MixUp test set. As illustrated in Figure 3, the mask is observed to concentrate on the central region when λ𝜆\lambdaitalic_λ is less than 0.3 and on the periphery when it is greater than 0.7. When λ𝜆\lambdaitalic_λ is between 0.3 and 0.7, however, the mask achieves an additional boost by outperforming MixUp by +0.91% and +5.73% on ResNet18 and FVRasNet, respectively.

  • In Table 4, we have performed ablation experiments on each module of LaKNet to verify the effect of the corresponding module. It can be seen that the Conv𝐶𝑜𝑛𝑣Convitalic_C italic_o italic_n italic_v set𝑠𝑒𝑡setitalic_s italic_e italic_t provides +0.36% improvement compared to LaKNet with only ConvK×K𝐶𝑜𝑛subscript𝑣𝐾𝐾Conv_{K\times K}italic_C italic_o italic_n italic_v start_POSTSUBSCRIPT italic_K × italic_K end_POSTSUBSCRIPT and a Conv5×5𝐶𝑜𝑛subscript𝑣55Conv_{5\times 5}italic_C italic_o italic_n italic_v start_POSTSUBSCRIPT 5 × 5 end_POSTSUBSCRIPT. The gating operation also helps the model’s classification performance by +0.74%. This is reasonable because the Conv𝐶𝑜𝑛𝑣Convitalic_C italic_o italic_n italic_v set𝑠𝑒𝑡setitalic_s italic_e italic_t expands the Effective Receptive Field using dilated Convolutions, which allows the model to capture more global features. The gating module filters the features, which further retains the valid high-dimensional features.

Table 3: Ablation experiments about StarMix and threshold setting on the VERA220 based ResNet18 and FVRasNet.
VERA220 ResNet18 FVRasNet
Baseline 65.82 61.73
Mixup 94.82 83.36
StarMix 94.72 87.37
w threshold 95.73 89.09
Gain +1.01 +1.72
Table 4: Ablation experiments about LakNet baseline and their modules effective on the VERA220 and TJU600 datasets.
Modules TJU600 VERA220
ResNet18 89.73 65.82
LaKNet 89.78 84.45
+ Conv𝐶𝑜𝑛𝑣Convitalic_C italic_o italic_n italic_v set𝑠𝑒𝑡setitalic_s italic_e italic_t 90.12 84.81
+ Gating 91.90 85.55
Gain +1.78 +0.74

6 Conclusion

In this paper, we have proposed StarLKNet, a Conv-based palm vein identification network with large kernels, that incorporates the StarMix data enhancement method and the LaKNet structure. StarMix generates masks for image mixing through Gaussian functions, which effectively reduces the overfitting problem caused by insufficient samples in the dataset and also improves the robustness of the model. The LaKNet network captures more globally effective features through large a Effective Receptive Field combined with the screening capability of the gating mechanism, thus stabilizing the recognition capability of the model. A series of classification and analysis experiments have been conducted to validate the outstanding performance of our method.

Future work.

Our future work will further explore how to increase the size of the Effective Receptive Field and the extent to which the latter affects model performance. We will also investigate methods to improve the time overhead problem caused by large convolutional kernels. Concerning the mixup method, we will attempt to develop an end-to-end approach.

References

  • Chou et al. [2020] H.-P. Chou, S.-C. Chang, J.-Y. Pan, W. Wei, and D.-C. Juan. Remix: rebalanced mixup. In Computer Vision–ECCV 2020 Workshops: Glasgow, UK, August 23–28, 2020, Proceedings, Part VI 16, pages 95–110. Springer, 2020.
  • Chugh et al. [2018] T. Chugh, K. Cao, and A. K. Jain. Fingerprint spoof buster: Use of minutiae-centered patches. IEEE Transactions on Information Forensics and Security, 13(9):2190–2202, 2018.
  • Cola et al. [2016] G. Cola, M. Avvenuti, F. Musso, and A. Vecchio. Gait-based authentication using a wrist-worn device. In Proceedings of the 13th International Conference on Mobile and Ubiquitous Systems: Computing, Networking and Services, pages 208–217, 2016.
  • da Wu and Liu [2011] J. da Wu and C.-T. Liu. Finger-vein pattern identification using svm and neural network technique. Expert Syst. Appl., 38:14284–14289, 2011.
  • Das et al. [2018] R. Das, E. Piciucco, E. Maiorana, and P. Campisi. Convolutional neural network for finger-vein-based biometric identification. IEEE Transactions on Information Forensics and Security, 14(2):360–373, 2018.
  • Ding et al. [2022] X. Ding, X. Zhang, J. Han, and G. Ding. Scaling up your kernels to 31x31: Revisiting large kernel design in cnns. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11963–11975, 2022.
  • Ding et al. [2023] X. Ding, Y. Zhang, Y. Ge, S. Zhao, L. Song, X. Yue, and Y. Shan. Unireplknet: A universal perception large-kernel convnet for audio, video, point cloud, time-series and image recognition. ArXiv, abs/2311.15599, 2023.
  • Dosovitskiy et al. [2021] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations (ICLR), 2021.
  • He et al. [2016] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016.
  • Huang et al. [2010] B. Huang, Y. Dai, R. Li, D. Tang, and W. Li. Finger-vein authentication based on wide line detector and pattern normalization. 2010 20th International Conference on Pattern Recognition, pages 1269–1272, 2010.
  • Kang et al. [2019] W. Kang, Y. Lu, D. Li, and W. Jia. From noise to feature: Exploiting intensity distribution as a novel soft biometric trait for finger vein recognition. IEEE Transactions on Information Forensics and Security, 14:858–869, 2019.
  • Kim et al. [2020a] J. Kim, W. Choo, H. Jeong, and H. O. Song. Co-mixup: Saliency guided joint mixup with supermodular diversity. In International Conference on Learning Representations, 2020a.
  • Kim et al. [2020b] J.-H. Kim, W. Choo, and H. O. Song. Puzzle mix: Exploiting saliency and local statistics for optimal mixup. In International Conference on Machine Learning, pages 5275–5285. PMLR, 2020b.
  • Li et al. [2021] S. Li, Z. Liu, Z. Wang, D. Wu, Z. Liu, and S. Z. Li. Boosting discriminative visual representation learning with scenario-agnostic mixup. ArXiv, abs/2111.15454, 2021.
  • Li et al. [2024] S. Li, Z. Wang, Z. Liu, C. Tan, H. Lin, D. Wu, Z. Chen, J. Zheng, and S. Z. Li. Moganet: Multi-order gated aggregation network. In International Conference on Learning Representations, 2024.
  • Liu et al. [2022a] S. Liu, T. Chen, X. Chen, X. Chen, Q. Xiao, B. Wu, M. Pechenizkiy, D. C. Mocanu, and Z. Wang. More convnets in the 2020s: Scaling up kernels beyond 51x51 using sparsity. ArXiv, abs/2207.03620, 2022a.
  • Liu et al. [2017] W. Liu, W. Li, L. Sun, L. Zhang, and P. Chen. Finger vein recognition based on deep learning. 2017 12th IEEE Conference on Industrial Electronics and Applications (ICIEA), pages 205–210, 2017.
  • Liu et al. [2018] Y. Liu, F. Wei, J. Shao, L. Sheng, J. Yan, and X. Wang. Exploring disentangled feature representation beyond face identification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2080–2089, 2018.
  • Liu et al. [2022b] Z. Liu, S. Li, D. Wu, Z. Liu, Z. Chen, L. Wu, and S. Z. Li. Automix: Unveiling the power of mixup for stronger classifiers. In European Conference on Computer Vision, pages 441–458. Springer, 2022b.
  • Liu et al. [2023] Z. Liu, S. Li, G. Wang, L. Wu, C. Tan, and S. Z. Li. Harnessing hard mixed samples with decoupled regularizer. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
  • Loshchilov and Hutter [2016] I. Loshchilov and F. Hutter. Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983, 2016.
  • Loshchilov and Hutter [2019] I. Loshchilov and F. Hutter. Decoupled weight decay regularization. In International Conference on Learning Representations, 2019.
  • Mathur and Matari’c [2020] L. Mathur and M. J. Matari’c. Introducing representations of facial affect in automated multimodal deception detection. Proceedings of the 2020 International Conference on Multimodal Interaction, 2020.
  • Miura et al. [2007] N. Miura, A. Nagasaka, and T. Miyatake. Extraction of finger-vein patterns using maximum curvature points in image profiles. IEICE Trans. Inf. Syst., 90-D:1185–1194, 2007.
  • Naseer et al. [2021] M. Naseer, K. Ranasinghe, S. H. Khan, M. Hayat, F. S. Khan, and M.-H. Yang. Intriguing properties of vision transformers. In Neural Information Processing Systems, 2021.
  • Olsson et al. [2021] V. Olsson, W. Tranheden, J. Pinto, and L. Svensson. Classmix: Segmentation-based data augmentation for semi-supervised learning. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pages 1369–1378, 2021.
  • Qin et al. [2021] H. Qin, M. A. El-Yacoubi, Y. Li, and C. Liu. Multi-scale and multi-direction gan for cnn-based single palm-vein identification. IEEE Transactions on Information Forensics and Security, 16:2652–2666, 2021.
  • Qin et al. [2023a] H. Qin, R. Hu, M. A. El-Yacoubi, Y. Li, and X. Gao. Local attention transformer-based full-view finger-vein identification. IEEE Transactions on Circuits and Systems for Video Technology, 33:2767–2782, 2023a.
  • Qin et al. [2023b] H. Qin, X. **, Y. Jiang, M. A. El-Yacoubi, and X. Gao. Adversarial automixup. arXiv preprint arXiv:2312.11954, 2023b.
  • Qin et al. [2024] H. Qin, H. Zhu, X. **, Q. Song, M. A. El-Yacoubi, and X. Gao. Emmixformer: Mix transformer for eye movement recognition. arXiv preprint arXiv:2401.04956, 2024.
  • Qin et al. [2020] J. Qin, J. Fang, Q. Zhang, W. Liu, X. Wang, and X. Wang. Resizemix: Mixing data with preserved object information and true labels. arXiv preprint arXiv:2012.11101, 2020.
  • Radzi et al. [2016] S. A. Radzi, M. Khalil-Hani, and R. Bakhteri. Finger-vein biometric identification using convolutional neural network. Turkish Journal of Electrical Engineering and Computer Sciences, 24:1863–1878, 2016.
  • Simonyan and Zisserman [2014] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
  • Tome and Marcel [2015] P. Tome and S. Marcel. On the vulnerability of palm vein recognition to spoofing attacks. In 2015 International Conference on Biometrics (ICB), pages 319–325. IEEE, 2015.
  • Uddin et al. [2020] A. S. Uddin, M. S. Monira, W. Shin, T. Chung, and S.-H. Bae. Saliencymix: A saliency guided data augmentation strategy for better regularization. In International Conference on Learning Representations, 2020.
  • Verma et al. [2019] V. Verma, A. Lamb, C. Beckham, A. Najafi, I. Mitliagkas, D. Lopez-Paz, and Y. Bengio. Manifold mixup: Better representations by interpolating hidden states. In International Conference on Machine Learning, pages 6438–6447, 2019.
  • Wang et al. [2020] G. Wang, C. Sun, and A. Sowmya. Multi-weighted co-occurrence descriptor encoding for vein recognition. IEEE Transactions on Information Forensics and Security, 15:375–390, 2020.
  • Wu et al. [2023] Y. Wu, H. Liao, H. Zhu, X. **, S. Yang, and H. Qin. Adversarial contrastive learning based on image generation for palm vein recognition. In 2023 2nd International Conference on Artificial Intelligence and Intelligent Information Processing (AIIIP), pages 18–24. IEEE, 2023.
  • Yang et al. [2022] S. Yang, Y. Wu, X. **, M. El Yacoubi, and H. Qin. Cgan-da: A cross-modality domain adaptation model for hand-vein biometric-based authentication. JOURNAL OF Cyber-Physical-Social Intelligence, 1:3–12, 2022.
  • Yang et al. [2020] W. Yang, W. Luo, W. Kang, Z. Huang, and Q. Wu. Fvras-net: An embedded finger-vein recognition and antispoofing system using a unified cnn. IEEE Transactions on Instrumentation and Measurement, 69(11):8690–8701, 2020.
  • Yao et al. [2022] H. Yao, Y. Wang, L. Zhang, J. Y. Zou, and C. Finn. C-mixup: Improving generalization in regression. Advances in neural information processing systems, 35:3361–3376, 2022.
  • Yoo et al. [2020] J. Yoo, N. Ahn, and K.-A. Sohn. Rethinking data augmentation for image super-resolution: A comprehensive analysis and a new strategy. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8375–8384, 2020.
  • Yu et al. [2023] W. Yu, P. Zhou, S. Yan, and X. Wang. Inceptionnext: When inception meets convnext. ArXiv, abs/2303.16900, 2023.
  • Yun et al. [2019] S. Yun, D. Han, S. J. Oh, S. Chun, J. Choe, and Y. Yoo. Cutmix: Regularization strategy to train strong classifiers with localizable features. In International Conference on Computer Vision (ICCV), pages 6023–6032, 2019.
  • Zhang et al. [2018a] H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz. mixup: Beyond empirical risk minimization. In International Conference on Learning Representations, 2018a.
  • Zhang et al. [2018b] L. Zhang, Z. Cheng, Y. Shen, and D. Wang. Palmprint and palmvein recognition based on dcnn and a new large-scale contactless palmvein dataset. Symmetry, 10(4):78, 2018b.
  • Zhou and Kumar [2011] Y. Zhou and A. Kumar. Human identification using palm-vein images. IEEE Transactions on Information Forensics and Security, 6(4):1259–1274, 2011.