Kolmogorov-Arnold Convolutions: Design Principles and Empirical Studies

[Uncaptioned image] Ivan Drokin
Deep Learning Researcher
Seath the Scaleless Research Group
[email protected]
Abstract

The emergence of Kolmogorov-Arnold Networks (KANs) has sparked significant interest and debate within the scientific community. This paper explores the application of KANs in the domain of computer vision (CV). We examine the convolutional version of KANs, considering various nonlinearity options beyond splines, such as Wavelet transforms and a range of polynomials. We propose a parameter-efficient design for Kolmogorov-Arnold convolutional layers and a parameter-efficient finetuning algorithm for pre-trained KAN models, as well as KAN convolutional versions of self-attention and focal modulation layers. We provide empirical evaluations conducted on MNIST, CIFAR10, CIFAR100, Tiny ImageNet, ImageNet1k, and HAM10000 datasets for image classification tasks. Additionally, we explore segmentation tasks, proposing U-Net-like architectures with KAN convolutions, and achieving state-of-the-art results on BUSI, GlaS, and CVC datasets. We summarized all of out finding in a preliminary design guide of KAN convolutional models for computer vision tasks. Furthermore, we investigate regularization techniques for KANs. All experimental code and implementations of convolutional layers and models, pre-trained on ImageNet1k weights are available on GitHub: https://github.com/IvanDrokin/torch-conv-kan.

Keywords Convolutional Kolmogorov-Arnold Networks  \cdot Kolmogorov-Arnold Networks  \cdot Computer Vision

1 Introduction

The rapid evolution of deep learning architectures has significantly advanced the field of computer vision, particularly in tasks that require the analysis of complex spatial data. Convolutional Neural Networks (CNNs), initially proposed by LeCun et al. [1], have become a cornerstone in this domain due to their ability to efficiently process high-dimensional data arrays such as images. These networks typically employ linear transformations followed by activation functions in their convolutional layers to discern spatial relationships, thereby reducing the number of parameters needed to capture intricate patterns in visual data. Since 2012, following the success of AlexNet [2] in the ImageNet classification challenge, CNNs have dominated the field of computer vision until the emergence of Vision Transformers [3]. Innovations such as Residual Networks [4] and Densely Connected networks [5], along with numerous subsequent works, have significantly advanced the achievable quality of models based on convolutional layers, enabling the effective training of very large and deep networks.

In segmentation tasks, especially within the biomedical domain, CNNs have also become foundational with the advent of the U-Net [6] architecture, which has subsequently inspired a whole family of U-Net-like architectures for segmentation tasks.

Recent developments in deep learning have seen the integration of sophisticated mathematical theories into neural network architectures, enhancing their capability to handle complex data structures. One such innovation is the Kolmogorov-Arnold Network (KAN) [7], which leverages the Kolmogorov-Arnold theorem to incorporate splines into its architecture, offering a compelling alternative to traditional Multi-Layer Perceptrons (MLPs). Quickly following the original work introducing KANs, modifications, and improvements have emerged that attempt to overcome various issues associated with the spline-based approach, namely computational overhead and a large number of trainable parameters.

In light of these advancements, this paper explores the adaptation of KANs to convolutional layers—a prevalent component in many CNN architectures used for image classification. Traditional CNNs, while effective, often rely on fixed activation functions and linear transformations, which can be enhanced through the flexibility and reduced parametric complexity offered by KANs. In this work, we explore various modifications of the original KAN model, propose efficient designs for convolutional KAN models, conduct empirical evaluations of the proposed approaches, and compare them with classical convolutional networks.

Our contributions can be summarized as follows:

  1. 1.

    We present the Bottleneck Convolutional Kolmogorov-Arnold layer, which retains the properties of Kolmogorov-Arnold layers while significantly reducing memory requirements.

  2. 2.

    We empirically investigate several modifications of convolutional KAN models in the context of image classification tasks, including studies on regularization and hyperparameter tuning, across multiple datasets.

  3. 3.

    We introduce a parameter-efficient finetuning algorithm for Gram polynomials in Convolutional Kolmogorov-Arnold Networks, which substantially reduces the number of trainable parameters needed when adapting a pre-trained model to new tasks.

  4. 4.

    We redesign U-Net-like models for segmentation tasks with Kolmogorov Arnold Convolutional layers instead if regular convolutions, demonstrating that such models achieve state-of-the-art results on three diverse biomedical datasets.

  5. 5.

    We propose redesigns of Self-Attention and Focal Modulation layers based on Bottleneck Convolutional Kolmogorov-Arnold layers, which significantly improve the performance of classification models.

  6. 6.

    Based on the results of our empirical studies, we formulate design principles for constructing successful computer vision models based on Bottleneck Convolutional Kolmogorov-Arnold layers.

  7. 7.

    We provide the entire codebase on GitHub, and pre-trained weights on ImageNet are available on HuggingFace to accelerate research and ensure reproducibility.

Our work is structured as follows. In Section 2, we present an overview of relevant works on Kolmogorov-Arnold networks. In Section 3, we describe the Kolmogorov-Arnold convolutional layer, its bottleneck version, attention and focal modulation layers, regularization techniques for the Kolmogorov-Arnold convolutional layer, and introduce and describe the PEFT algorithm for polynomial variants of Kolmogorov-Arnold convolutional networks. In Section 4, we present experiments on image classification and segmentation tasks. In Section 5, we conduct an ablation study for the bottleneck Kolmogorov-Arnold convolutional layer. In Section 6, we summarize all experimental results into a design guide for constructing Kolmogorov-Arnold convolutional networks.

2 Related works

The application of the Kolmogorov-Arnold theorem in neural networks marks a significant theoretical integration that enhances the expressiveness and efficiency of neural models. The theorem, which provides a way to represent any multivariate continuous function as a composition of univariate functions and additions, has been adapted in the design of Kolmogorov-Arnold Networks (KANs). KANs differ from traditional Multi-Layer Perceptrons (MLPs) by replacing linear weight matrices with learnable splines, thus reducing the number of parameters required and potentially improving the generalization capabilities of the network.

Recent research has proposed several variations of KANs to address specific limitations and enhance their performance.

Fast KAN by Li et al. (2024) introduced an adaptation where B-splines are replaced by Radial Basis Functions (RBFs). This modification aims to reduce the computational overhead associated with splines. The work demonstrated that third-order B-splines used in traditional KANs could be effectively approximated by Gaussian radial basis functions, resulting in FastKAN—a faster implementation of KAN which also functions as an RBF network.

Wavelet-based KANs (Wav-KAN), as presented by Bozorgasl et al. (2024), incorporate wavelet functions into the KAN structure to enhance both interpretability and performance. Wav-KAN leverages the properties of wavelets to efficiently capture high-frequency and low-frequency components of input data, balancing accurate representation of data structures with robustness against overfitting. The implementation employs discrete wavelet transforms (DWT) for multiresolution analysis, which simplifies the computation process. Wav-KAN has demonstrated enhanced accuracy, faster training speeds, and increased robustness compared to traditional Spl-KAN and MLPs.

ReLU-KAN by Qiu et al. (2024) addresses the computational complexity of basis function calculations in traditional KANs by introducing a novel implementation that utilizes ReLU (Rectified Linear Unit) and point-wise multiplication. This approach optimizes the computation process for efficient CUDA computing, achieving a significant speedup (20x) over traditional KANs while maintaining stable training and superior fitting ability. ReLU-KAN preserves the "catastrophic forgetting avoidance" property of KANs, making it a practical choice for both inference and training within existing deep learning frameworks.

Polynomial-based variations of KANs have also been explored. Chebyshev KAN replaces B-splines with Chebyshev polynomials, which are known for their excellent function approximation capabilities and can be calculated recursively. This approach aims to improve the performance and intuitiveness of KANs. Meanwhile, Gram KAN leverages the simplicity of Gram polynomial transformations, characterized by their discrete nature. This discrete approach is particularly suited for handling discretized datasets like images and text data, offering a novel method for data processing in neural networks.

Kolmogorov Arnold Legendre Network (KAL-Net) [8] represents a novel architecture using Legendre polynomials to surpass traditional polynomial approximations like splines in KANs. KAL-Net utilizes Legendre polynomials up to a specific order for input normalization, capturing nonlinear relationships more efficiently. By leveraging caching mechanisms and recurrence relations, KAL-Net enhances computational efficiency. It employs SiLU (Sigmoid Linear Unit) activation functions and layer normalization to stabilize outputs and improve training stability. KAL-Net has demonstrated remarkable accuracy (97.8%) on the MNIST dataset and efficiency, with an average forward pass taking only 500 microseconds, highlighting its potential in handling complex image patterns and computational efficiency.

Convolutional Kolmogorov-Arnold Networks (Convolutional KANs) [9] were introduced as an alternative to standard Convolutional Neural Networks (CNNs) in a recent study. This approach integrates the non-linear activation functions from KANs into convolutional layers, creating a new layer type that maintains similar accuracy levels to traditional CNNs while using half the number of parameters. This significant reduction in parameters highlights a promising direction for optimizing neural network architectures.

In the domain of medical image segmentation, U-Net [6] has become a foundational architecture, utilizing an encoder-decoder structure to effectively capture image features. Variations such as U-Net++ [10] improve segmentation accuracy through nested structures that fuse multi-scale features. Beyond convolution-based methods, transformer-based models like Vision Transformer [3] and TransUNet [11] have shown effectiveness in incorporating global context into medical image segmentation.

U2superscript𝑈2U^{2}italic_U start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT-Net [12], designed for salient object detection, employs a two-level nested U-structure to capture contextual information across different scales. This architecture increases depth without significantly increasing computational cost, offering competitive performance on various datasets.

Li et al. [13] investigate and redesign the established U-Net pipeline by integrating Kolmogorov-Arnold Network (KAN) layers into the intermediate tokenized representations, resulting in the U-KAN architecture.

3 Method

This section details the methods used in our study, focusing on Kolmogorov-Arnold convolutions and their various adaptations and enhancements. The subsections cover the following topics:

In Section 3.1, we provide a brief overview of Kolmogorov-Arnold convolutions, as initially presented in [9], including their formalization and various basis function options such as splines, Radial-Basis Functions, Wavelets, and polynomials. We also introduce the use of Gram polynomials for parameter-efficient fine-tuning.

Section 3.2 addresses the primary issue with Kolmogorov-Arnold Convolutions—the high number of parameters introduced by the basis functions. We propose a bottleneck version to mitigate this problem, involving a squeezing convolution before and an expanding convolution after applying the basis function. This design includes a mixture of experts for effective implementation.

In Section 3.3, we describe the construction of Self-KAGtention layers by substituting traditional convolutions with Kolmogorov-Arnold convolutional layers. Additionally, we introduce Focal KAGN Modulation, where all convolutional layers in the original focal modulation are replaced with Kolmogorov-Arnold convolutional layers.

Section 3.4 discusses various regularization techniques, including weight and activation penalties, dropout placements, and additive Gaussian noise injection. We explore the impact of these techniques on Kolmogorov-Arnold convolutional networks.

Finally in Section 3.5, we introduce the parameter-efficient finetuning algorithm for polynomial variants of Kolmogorov-Arnold convolutional networks. We outline several options for parameter-efficient fine-tuning, aiming to reduce the number of trainable parameters while adapting pre-trained models to new tasks.

3.1 Kolmogorov-Arnold Convolutions

Kolmogorov-Arnold convolutions were presented in [9], in this section we briefly cover the formalization of Kolmogorov-Arnold convolutions.

Kolmogorov-Arnold Convolutions could be stated as follows: the kernel consists of a set of univariate non-linear functions. Suppose we have an input image y,yRc×n×m𝑦𝑦superscript𝑅𝑐𝑛𝑚y,y\in R^{c\times n\times m}italic_y , italic_y ∈ italic_R start_POSTSUPERSCRIPT italic_c × italic_n × italic_m end_POSTSUPERSCRIPT, where c𝑐citalic_c is the number of channels, and n,m𝑛𝑚n,mitalic_n , italic_m are the height and width of an image respectively. Then, KAN-based convolutions with kernel size k𝑘kitalic_k could be defined as:

xij=d=1ca=0k1b=0k1φa,b,d(yd,i+a,j+b);i=1,nk+1¯,j=1,mk+1¯formulae-sequencesubscript𝑥𝑖𝑗superscriptsubscript𝑑1𝑐superscriptsubscript𝑎0𝑘1superscriptsubscript𝑏0𝑘1subscript𝜑𝑎𝑏𝑑subscript𝑦𝑑𝑖𝑎𝑗𝑏formulae-sequence𝑖¯1𝑛𝑘1𝑗¯1𝑚𝑘1\displaystyle x_{ij}=\sum_{d=1}^{c}\sum_{a=0}^{k-1}\sum_{b=0}^{k-1}\varphi_{a,% b,d}(y_{d,i+a,j+b});i=\overline{1,n-k+1},j=\overline{1,m-k+1}italic_x start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_d = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_a = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_b = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT italic_φ start_POSTSUBSCRIPT italic_a , italic_b , italic_d end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_d , italic_i + italic_a , italic_j + italic_b end_POSTSUBSCRIPT ) ; italic_i = over¯ start_ARG 1 , italic_n - italic_k + 1 end_ARG , italic_j = over¯ start_ARG 1 , italic_m - italic_k + 1 end_ARG

Each φ𝜑\varphiitalic_φ is a univariate non-linear learnable function with its own trainable parameters. In the original paper[7], the authors propose to use this form of the functions:

φ=wbb(x)+φ~(x),𝜑subscript𝑤𝑏𝑏𝑥~𝜑𝑥\displaystyle\varphi=w_{b}\cdot b(x)+\tilde{\varphi}(x),italic_φ = italic_w start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ⋅ italic_b ( italic_x ) + over~ start_ARG italic_φ end_ARG ( italic_x ) ,
φ~(x)=wsSpline(x)~𝜑𝑥subscript𝑤𝑠𝑆𝑝𝑙𝑖𝑛𝑒𝑥\displaystyle\tilde{\varphi}(x)=w_{s}\cdot Spline(x)over~ start_ARG italic_φ end_ARG ( italic_x ) = italic_w start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ⋅ italic_S italic_p italic_l italic_i italic_n italic_e ( italic_x )
b(x)=SiLU(x)=x/(1+ex)𝑏𝑥𝑆𝑖𝐿𝑈𝑥𝑥1superscript𝑒𝑥\displaystyle b(x)=SiLU(x)=x/(1+e^{-x})italic_b ( italic_x ) = italic_S italic_i italic_L italic_U ( italic_x ) = italic_x / ( 1 + italic_e start_POSTSUPERSCRIPT - italic_x end_POSTSUPERSCRIPT )

Similar to KANs, other than splines could be chosen as basis function φ~(x)~𝜑𝑥\tilde{\varphi}(x)over~ start_ARG italic_φ end_ARG ( italic_x ): Radial-Basis Function, Wavelets, polynomials, etc.

Replacing splines with Gram polynomials was proposed in [14]. The Gram polynomials, or the discrete Chebyshev polynomial, tnN(x)subscriptsuperscript𝑡𝑁𝑛𝑥t^{N}_{n}(x)italic_t start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_x ) is a polynomial of degree n𝑛nitalic_n in x𝑥xitalic_x, for n=0,1,2,,N1𝑛012𝑁1n=0,1,2,\ldots,N-1italic_n = 0 , 1 , 2 , … , italic_N - 1, constructed such that two polynomials of unequal degree are orthogonal with respect to the weight function w(x)=r=0N1δ(xr),𝑤𝑥superscriptsubscript𝑟0𝑁1𝛿𝑥𝑟w(x)=\sum_{r=0}^{N-1}\delta(x-r),italic_w ( italic_x ) = ∑ start_POSTSUBSCRIPT italic_r = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N - 1 end_POSTSUPERSCRIPT italic_δ ( italic_x - italic_r ) , with δ()𝛿\delta(\cdot)italic_δ ( ⋅ ) being the Dirac delta function. That is, tnN(x)tmN(x)w(x)𝑑x=0 if nm.formulae-sequencesuperscriptsubscriptsubscriptsuperscript𝑡𝑁𝑛𝑥subscriptsuperscript𝑡𝑁𝑚𝑥𝑤𝑥differential-d𝑥0 if 𝑛𝑚\int_{-\infty}^{\infty}t^{N}_{n}(x)t^{N}_{m}(x)w(x)\,dx=0\quad\text{ if }\quad n% \neq m.∫ start_POSTSUBSCRIPT - ∞ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_t start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_x ) italic_t start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_x ) italic_w ( italic_x ) italic_d italic_x = 0 if italic_n ≠ italic_m . In the case of splines by Gram polynomials, KAN Convolutions are defined as follows.

φ=wbb(x)+φ~(x),𝜑subscript𝑤𝑏𝑏𝑥~𝜑𝑥\displaystyle\varphi=w_{b}\cdot b(x)+\tilde{\varphi}(x),italic_φ = italic_w start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ⋅ italic_b ( italic_x ) + over~ start_ARG italic_φ end_ARG ( italic_x ) ,
φ~(x)=i=0N+1witni(x)~𝜑𝑥superscriptsubscript𝑖0𝑁1subscript𝑤𝑖subscriptsuperscript𝑡𝑖𝑛𝑥\displaystyle\tilde{\varphi}(x)=\sum_{i=0}^{N+1}w_{i}\cdot t^{i}_{n}(x)over~ start_ARG italic_φ end_ARG ( italic_x ) = ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N + 1 end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ italic_t start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_x )
b(x)=SiLU(x)=x/(1+ex)𝑏𝑥𝑆𝑖𝐿𝑈𝑥𝑥1superscript𝑒𝑥\displaystyle b(x)=SiLU(x)=x/(1+e^{-x})italic_b ( italic_x ) = italic_S italic_i italic_L italic_U ( italic_x ) = italic_x / ( 1 + italic_e start_POSTSUPERSCRIPT - italic_x end_POSTSUPERSCRIPT )

This reformulation, on one side, allows an option for parameter-efficient fine-tuning of the pre-trained model, and on the other hand, it reduces the number of trainable parameters.

3.2 Bottleneck Kolmogorov-Arnold Convolutions

Refer to caption
Figure 1: KAN Convolution (left) and Bottleneck KAN Convolution (right). The main difference between these two types of layers is a encoder-decoder convolutional layers on the right data stream.

The main problem with KAN Convolutions lies in the Spline𝑆𝑝𝑙𝑖𝑛𝑒Splineitalic_S italic_p italic_l italic_i italic_n italic_e part of the model. Whatever type of basis function one chooses, the basis introduces a lot of parameters to the model, which leads to higher resource requirements during training and increases the probability of overfitting. To overcome those issues, we propose to use Bottleneck Kolmogorov-Arnold Convolutions (see Fig.1).

Before applying basis to input data, we propose to use squeezing convolution with kernel size equal to 1 before applying basis function to input and expanding convolution with kernel size equal to 1 after. Intuitively, it could be considered a one-layer encoder that helps extract meaningful features from the input before processing it via a chosen basis, and then a one-layer decoder decodes it back. residual activation helps to preserve necessary details that could be lost during encoding and decoding of the input.

Refer to caption
Figure 2: Bottleneck Kolmogorov-Arnold Convolutional Mixture of Experts. The router and experts are placed between bottleneck convolutions, and each expert is a φ~~𝜑\tilde{\varphi}over~ start_ARG italic_φ end_ARG set of univariate functions. We use sparsely-gated mixture-of-experts [15].
Refer to caption
Figure 3: Possible dropout layer placement inside KAGN Convolution layer: Full - before the layer, Poly - before computing Gram basis, and Degree - before weighted sum of previously computed basis.

We compare the number of trainable parameters across various layer types (see Fig.4). The reduction parameter in the Bottleneck KAGN Conv 2D layer controls the ratio of dxsubscript𝑑𝑥d_{x}italic_d start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT and dϕsubscript𝑑italic-ϕd_{\phi}italic_d start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT: reduction=dx/dφ𝑟𝑒𝑑𝑢𝑐𝑡𝑖𝑜𝑛subscript𝑑𝑥subscript𝑑𝜑reduction=d_{x}/d_{\varphi}italic_r italic_e italic_d italic_u italic_c italic_t italic_i italic_o italic_n = italic_d start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT / italic_d start_POSTSUBSCRIPT italic_φ end_POSTSUBSCRIPT. This parameter is crucial for determining the efficiency and scalability of the model. By adjusting the reduction parameter, we can balance the complexity and performance of the model, emphasizing the importance of careful parameter management in deep learning architectures.

Such design allows us to implement a mixture of experts effectively. We utilize sparsely-gated mixture-of-experts [15] in-between the encoder and decoder convolutions, with a set of φi~(x)~subscript𝜑𝑖𝑥\tilde{\varphi_{i}}(x)over~ start_ARG italic_φ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ( italic_x ) as experts (see Fig.2).

Refer to caption
Figure 4: Comparison of the number of trainable parameters for different types of layers. The reduction parameter for Bottleneck KAGN Conv 2D is responsible for the ratio of dxsubscript𝑑𝑥d_{x}italic_d start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT to dφsubscript𝑑𝜑d_{\varphi}italic_d start_POSTSUBSCRIPT italic_φ end_POSTSUBSCRIPT

3.3 Kolmogorov-Arnold Sefl-Attention and Focal Modulation

In [16], a self-attention layer was introduced where the K, Q, and V projections were replaced by convolutional layers instead of linear projections. In this paper, we propose a similar approach to construct Self-KAGtention layers by substituting traditional convolutions with KAN convolutional layers. Given that the self-attention operation requires O(n2)𝑂superscript𝑛2O(n^{2})italic_O ( italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) memory, where n𝑛nitalic_n is the number of tokens or tensor pixels in convolutional models, we suggest using optional bottleneck convolutions with a 1×1111\times 11 × 1 kernel and placing the self-attention layer between these two convolutions. We call this version Self-KAGNtention.

Yang et al. [17] introduce the Focal Modulation layer to serve as a seamless replacement for the Self-Attention Layer. The layer boasts high interpretability, making it a valuable tool for Deep Learning practitioners. Here we propose a Focal KAGN Modulation, where all convolutional layers from the original focal modulation are replaced with KAN convolutional layers. As the Focal Modulation layer uses grouped convolutions in the hierarchical contextualization stream of data processing with the number of groups equals to the filter number, we note that in this case, Bottleneck KAN Convolutions should be replaced by KAN Convolutions.

3.4 Regularizations in Kolmogorov-Arnold Convolutional

Applying regularization techniques to Kolmogorov-Arnold Convolutional involves straightforward weight and activation penalties. However, dropout requires careful consideration of its placement. Let’s describe the polynomial version of Kolmogorov-Arnold Convolutional. Instead of computing splines over x, we could use Gram, Chebyshev, Legendre, and other Polynomials. In this case, we first need to compute the polynomial basis over x, and then perform the weighted sum of. (Fig.3). In that case, we have 3 possible options for dropout placement: before the layer (we will refer to this position as "Full"), before polynomial basis calculation (we will refer to this position as "Poly"), and before weight application to polynomials (we will refer to this position as "Degree").

The authors of [7] state the benefits of KANs over MLPs, and one of them was robustness to noise and adversarial attacks. From this observation, we could derive an alternative way of regularization. Instead of zeroing out some fraction of neurons, we could add additive Gaussian noise to a layer’s input, thus forcing the model to filter this noise and be more robust against noise in unseen data. Similar to dropout, there are three placements available: "Full", "Poly", and "Degree". More formal, for a given neuron yi=f(wixi)subscript𝑦𝑖𝑓subscript𝑤𝑖subscript𝑥𝑖y_{i}=f(w_{i}\cdot x_{i})italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_f ( italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ), during noise injection, the neuron output is:

yi={yi+αεi,εiN(0,σ2(y))with probability pyiwith probability 1psuperscriptsubscript𝑦𝑖casessimilar-tosubscript𝑦𝑖𝛼subscript𝜀𝑖subscript𝜀𝑖𝑁0superscript𝜎2𝑦with probability 𝑝subscript𝑦𝑖with probability 1𝑝\displaystyle y_{i}^{\prime}=\begin{cases}y_{i}+\alpha\varepsilon_{i},% \varepsilon_{i}\sim N(0,\sigma^{2}(y))&\text{with probability }p\\ y_{i}&\text{with probability }1-p\end{cases}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = { start_ROW start_CELL italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_α italic_ε start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_ε start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ italic_N ( 0 , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_y ) ) end_CELL start_CELL with probability italic_p end_CELL end_ROW start_ROW start_CELL italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_CELL start_CELL with probability 1 - italic_p end_CELL end_ROW

In this equation, α𝛼\alphaitalic_α is a parameter that controls the amount of added noise, and σ2(y\sigma^{2}(yitalic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_y) is a variance of the input, computed for each input channel.

3.5 Parameter Efficient Finetuning

Let’s assume we have a pre-trained model L𝐿Litalic_L with Gram KAN convolution layers and we want to fine-tune this model on downstream tasks, e.g. classification dataset D={xi,yi}i=1T𝐷superscriptsubscriptsubscript𝑥𝑖subscript𝑦𝑖𝑖1𝑇D=\{x_{i},y_{i}\}_{i=1}^{T}italic_D = { italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT. Let Gi(wbi,wgi)subscript𝐺𝑖subscriptsuperscript𝑤𝑖𝑏subscriptsuperscript𝑤𝑖𝑔G_{i}(w^{i}_{b},w^{i}_{g})italic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT , italic_w start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ) be a i𝑖iitalic_i-th layer of the model L𝐿Litalic_L, and wbisubscriptsuperscript𝑤𝑖𝑏w^{i}_{b}italic_w start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT are the weights of residual activation and wgi={wji}j=0N+1subscriptsuperscript𝑤𝑖𝑔superscriptsubscriptsubscriptsuperscript𝑤𝑖𝑗𝑗0𝑁1w^{i}_{g}=\{w^{i}_{j}\}_{j=0}^{N+1}italic_w start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT = { italic_w start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_j = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N + 1 end_POSTSUPERSCRIPT are the weights of Gram polynomials (see formula 3.1). Assume that model L𝐿Litalic_L has M𝑀Mitalic_M Gram KAN convolutional layers.

Then we have several options for parameter-efficient fine-tuning for a downstream task.

  • Only w~g={wji}j=RN+1,R>1formulae-sequencesubscript~𝑤𝑔superscriptsubscriptsubscriptsuperscript𝑤𝑖𝑗𝑗𝑅𝑁1𝑅1\tilde{w}_{g}=\{w^{i}_{j}\}_{j=R}^{N+1},R>1over~ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT = { italic_w start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_j = italic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N + 1 end_POSTSUPERSCRIPT , italic_R > 1 are trainable parameters, while all other parameters of Gi(wbi,wgi)subscript𝐺𝑖subscriptsuperscript𝑤𝑖𝑏subscriptsuperscript𝑤𝑖𝑔G_{i}(w^{i}_{b},w^{i}_{g})italic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT , italic_w start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ) for all i𝑖iitalic_i are frozen.

  • New parameter wi~N+2subscript~superscript𝑤𝑖𝑁2\tilde{w^{i}}_{N+2}over~ start_ARG italic_w start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_ARG start_POSTSUBSCRIPT italic_N + 2 end_POSTSUBSCRIPT could be introduced and the Gram basis will be expanded by one (or more) polynomial of higher degree.

Then we can formalize the Parameter Efficient Finetuning algorithm for Gram KAN convolutional models as follows.

𝒞(L(xt),yt){wji}j=RlRu,i=1Mt=1Tminsuperscriptsubscriptsubscriptsuperscript𝑤𝑖𝑗𝑗subscript𝑅𝑙subscript𝑅𝑢𝑖1𝑀𝑡1𝑇𝒞𝐿subscript𝑥𝑡subscript𝑦𝑡𝑚𝑖𝑛\displaystyle\mathcal{C}(L(x_{t}),y_{t})\xrightarrow[\{w^{i}_{j}\}_{j=R_{l}}^{% R_{u}},i=1\dots M]{t=1\dots T}mincaligraphic_C ( italic_L ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_ARROW start_UNDERACCENT { italic_w start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_j = italic_R start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , italic_i = 1 … italic_M end_UNDERACCENT start_ARROW start_OVERACCENT italic_t = 1 … italic_T end_OVERACCENT → end_ARROW end_ARROW italic_m italic_i italic_n

Where 𝒞𝒞\mathcal{C}caligraphic_C is a loss function, Rlsubscript𝑅𝑙R_{l}italic_R start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT and Rusubscript𝑅𝑢R_{u}italic_R start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT define the minimum and maximum degrees of Gram polynomials, respectively, at which the corresponding weights will be updated. In case when Ru>N+1subscript𝑅𝑢𝑁1R_{u}>N+1italic_R start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT > italic_N + 1 we add additional polynomials to the basis and initialize them with zeros.

These fine-tuning options sufficiently reduce the number of trainable parameters. Intuitively, one could consider this method as a refining of high-order features to match new data distribution. Empirical evaluation of this method is presented in section 4.6.

4 Experiments

In this section, we present the following experiments. Section 4.1 includes experiments on the MNIST [18], CIFAR10, and CIFAR100 [19] datasets with different formalizations of KAN convolutions. Section 4.2 presents experiments with various regularizations and hyperparameter optimization. Section 4.3 provides results for Bottleneck convolutions on CIFAR100 and Tiny ImageNet [20] and also considers ResNet-like and DenseNet-like architectures. Section 4.4 presents results of self-attention layers experiments. Section 4.5 presents results for ImageNet1k dataset, and Section 4.6 presents results of parameter-efficient finetuning. Section 4.7 presents results for segmentation tasks.

4.1 Baseline on MNIST, CIFAR10 and CIFAR100

Baseline models were chosen to be simple networks with 4 and 8 convolutional layers. To reduce the feature’s spatial dimensions, convolutions with dilation=2 were used. In the 4-layer model, the second and third convolutions had dilation=2, while in the 8-layer model, the second, third, and sixth convolutions had dilation=2.

The number of channels in the convolutions was the same for all models.

  • For 4 layers: 32, 64, 128, 512

  • For 8 layers: 2, 64, 128, 512, 1024, 1024, 1024, 1024

After the convolutions, Global Average Pooling was applied, followed by a linear output layer.

In the case of classic convolutions, a traditional structure was used: convolution - batch normalization - ReLU. In the case of KAN convolutions, after KAN convolution layer batch normalization and SiLU are applied. All experiments were conducted on an NVIDIA RTX 3090 with identical training parameters.

In this section, we are investigating performance of several KANs options: spline-based (KANConv), RBF-based (FastKANConv), Legendre polynomials version (KALNConv), Chebyshev polynomials (KACNConv), Gram polynomials (KAGNConv) and Wavelet KANs (WavKANConv).

Model MNIST CIFAR10 CIFAR100
Val. Accuracy Params., M Eval Time, s Val. Accuracy Params., Eval Time, s Val. Accuracy Params., Eval Time, s
Conv, 4 layers, baseline 99.42 0.1 0.7008 73.18 0.1 1.8321 42.29 0.12 1.5994
KANConv, 4 layers 99.00 3.49 2.6401 52.08 3.49 3.7972 21.78 3.52 4.0262
FastKANConv, 4 layers 97.65 3.49 1.5999 64.95 3.49 2.3716 34.32 3.52 2.7457
KALNConv, 4 layers 84.85 1.94 1.7205 10.28 1.94 3.0527 5.97 1.97 3.0919
KACNConv, 4 layers 97.62 3.92 1.6710 52.01 3.92 2.3972 23.17 0.42 2.6522
KAGNConv, 4 layers 99.49 0.49 1.7253 65.84 0.49 2.2570 47.36 1.97 2.3399
WavKANConv, 4 layers 99.23 0.95 7.4622 73.63 0.95 11.2276 41.50 0.98 11.4744
Conv, 8 layers, baseline 99.63 1.14 1.2061 83.05 1.14 1.8258 57.52 1.19 1.8265
KANConv, 8 layers 99.37 40.7 4.2011 74.66 40.7 5.4858 36.18 40.74 5.7067
FastKANConv, 8 layers 99.49 40.7 2.1653 74.66 40.7 5.4858 43.32 40.74 2.7771
KALNConv, 8 layers 49.97 22.61 1.7815 15.97 22.61 2.7348 1.74 22.65 2.6863
KACNConv, 8 layers 99.32 18.09 1.6973 62.14 18.09 2.3459 25.01 18.14 2.3826
KAGNConv, 8 layers 99.68 22.61 2.2402 84.14 22.61 2.5849 59.27 22.66 2.6460
WavKANConv, 8 layers 99.57 10.73 59.1734 85.37 10.73 28.0385 55.43 10.78 30.5438
Table 1: Results on MNIST, CIFAR10, and CIFAR100 datasets

As we can see from the Table 1, Gram polynomials-based and Wavelet-based versions perform better than other other options, and outperform vanilla convolutions. Due to wavelet-based KANs’ higher computational resource requirements, we will focus on Gram KANs as the main basis function option in further research.

4.2 Regularization study and hyperparameters optimization

4.2.1 Regularization study

Baseline model was chosen to be simple networks with 8 convolutional layers with Gram polynomials as basis functions. To reduce the feature’s spatial dimensions, convolutions with dilation=2 were used: the second, third, and sixth convolutions had dilation=2.

We explore two sets of convolutional layer filters:

  • Slim: 16, 32, 64, 128, 256, 256, 512, 512

  • Wide: 32, 64, 128, 256, 512, 512, 1024, 1024

After the convolutions, Global Average Pooling was applied, followed by a linear output layer. In these experiments, a dropout after Global Pooling hasn’t been used.

All experiments were conducted on an NVIDIA RTX 3090 with identical training parameters. As part of this research, we aim to find answers to the following questions:

  • What is the impact of L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT activation penalties of ConvKAN layers on the model?

  • What is the impact of L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT regularization of the weights of ConvKAN layers on the model?

  • How does the dropout placement within a ConvKAN layer impact the model? Essentially, there are three placements: before the layer (Full), before the calculation of the polynomial basis (Poly), and before applying the weights to the calculated polynomials (Degree).

  • Since KAN models are supposed to be more robust to noise in the data, can we replace the Dropout layer with additive Gaussian noise as a regularization technique?

Regularization Slim Wide
Train Accuracy Val. Accuracy Train Accuracy Val. Accuracy
None 73.73 61.30 86.88 67.04
L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT Activation, 1e-08 74.49 61.20 86.69 66.41
L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT Activation, 1e-07 74.64 61.41 86.60 66.88
L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT Activation, 1e-06 74.11 61.74 86.73 67.09
L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT Activation, 1e-08 73.85 61.36 86.61 67.49
L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT Activation, 1e-07 73.96 61.55 86.78 66.81
L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT Activation, 1e-06 73.10 60.29 86.99 66.30
L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT Weight, 1e-08 74.31 61.39 86.41 66.79
L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT Weight, 1e-07 73.48 61.08 86.41 66.79
L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT Weight, 1e-06 73.60 60.96 87.10 66.83
Poly Dropout, 0.05 68.69 60.98 84.97 66.83
Full Dropout, 0.05 60.98 63.64 79.53 68.93
Degree Dropout, 0.05 70.12 62.91 85.01 67.75
Poly Dropout, 0.15 65.43 58.63 82.07 62.26
Full Dropout, 0.15 43.77 56.10 63.19 66.38
Degree Dropout, 0.15 64.37 62.28 81.24 68.70
Poly Dropout, 0.25 61.93 51.27 80.22 59.70
Full Dropout, 0.25 32.99 47.65 49.49 59.56
Degree Dropout, 0.25 60.09 60.69 77.46 67.85
Poly Noise Injection, 0.05 70.42 62.06 84.65 67.11
Full Noise Injection, 0.05 59.75 62.63 79.71 69.18
Degree Noise Injection, 0.05 69.45 62.40 85.00 68.43
Poly Noise Injection, 0.15 65.75 57.64 82.78 63.82
Full Noise Injection, 0.15 43.78 56.15 63.30 66.07
Degree Noise Injection, 0.15 64.52 61.78 81.33 68.23
Poly Noise Injection, 0.25 63.38 53.14 80.74 59.96
Full Noise Injection, 0.25 33.54 48.24 49.58 59.70
Degree Noise Injection, 0.25 60.17 60.98 77.26 67.82
Table 2: Regularization study on CIFAR10 dataset

Based on the conducted experiments and results presented in Table 2, the following conclusions can be drawn:

  • It seems that Full Dropout and Full Noise Injection are the best options for regularization, hel** to combat model overfitting.

  • L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT/L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT activation penalties, as well as L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT weight penalty, slightly improve the situation, but not significantly.

  • In a wider model, the impact of regularization is greater.

4.2.2 Scaling KANs

In classical convolutional networks, we have two major options for scaling up models: we can go deeper and stack more layers, or we can go wider and expand the number of convolutional filters. There are other ways to scale up models, like leveraging a mixture of experts.

KAN Convs with Gram polynomials as basis functions provide us with another possibility for scaling: instead of inflating channel numbers or adding new layers, we could increase the degree of polynomials.

During the experiments, we used the same augmentations as in the previous section, NoiseInjection in Full positions for regularization with p = 0.05 and linear dropout with p = 0.05.

The baseline model was chosen to be simple networks with 8, 12, and 16 convolutional layers with Gram polynomials as basis functions. To reduce the feature’s spatial dimensions, convolutions with dilation=2 were used: the second, third, and sixth convolutions had dilation=2 for 8 and 12 layers models, and the second, fourth, and eighth for 16 layers models.

The models have the following sets of convolutional layer filters with a width scale equal to 1:

  • 8 layers: 16, 32, 64, 128, 256, 256, 512, 512

  • 12 layers: 16, 32, 64, 128, 256, 256, 256, 256, 256, 512, 512, 512

  • 16 layers: 16, 16, 32, 32, 64, 64, 128, 128, 256, 256, 256, 256, 512, 512, 512, 512

Refer to caption
Figure 5: Scaling experiments on CIFAR100 dataset: Accuracy vs. Model depth (left), Accuracy vs. Model degree (center), and Accuracy vs Model width (right).

Our findings indicate that scaling model width yields better performance than increasing model depth. Additionally, there is no observed benefit in scaling the Gram’s degree of the model. However, it is noteworthy that the number of trainable parameters increases rapidly in any configuration, whether depth, width, or degree is scaled. The lack of benefits from depth and degree scaling may be attributed to the relatively small dataset size. It is plausible that larger datasets could produce different outcomes.

4.2.3 Hyperparameters tuning

To identify an optimal set of hyperparameters and mitigate the risk of overfitting to the test set, we partitioned the CIFAR100 training dataset into new training and validation sets in an 80/20 ratio. Following the completion of the hyperparameter search, we trained the model on the entire CIFAR100 training set and evaluated it on the complete test set. This study aims to determine effective hyperparameters for eight-layer models. The search space and optimal parameters were established after 50 optimization runs, with the best parameters achieving an accuracy of 61.85

  • L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT activation penalty, optimal value 107superscript10710^{-7}10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT

  • L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT activation penalty, optimal value 106superscript10610^{-6}10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT

  • L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT weight decay, optimal value 00

  • Dropout rate before the output layer, optimal value 0.14563519519902770.14563519519902770.14563519519902770.1456351951990277

  • Dropout/Noise Injection, Full placement, optimal value: Full

  • “Dropout type”: use either Dropout or Noise Injection inside ConvKAGN layers optimal value: Noise Injection

  • Width scale: parameter that expands the number of channels, optimal value 6666

  • Degree of Gram polynomials, optimal value 3333

  • Adam weights decay, optimal value 6.5797854897830221066.579785489783022superscript1066.579785489783022\cdot 10^{-6}6.579785489783022 ⋅ 10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT

  • Learning rate, optimal value 0.0007795383569589370.0007795383569589370.0007795383569589370.000779538356958937

  • Learning rate power: parameter that controls the learning rate scheduler, optimal value 1.12753505386547381.12753505386547381.12753505386547381.1275350538654738

  • Label smoothing, optimal value 0.18237068161668310.18237068161668310.18237068161668310.1823706816166831

The results of the models on this parameter set are presented in Table 3 and Fig.6.

Refer to caption
Figure 6: Train and validation accuracy curves on CIFAR100, with hyperparameters discovered by hyperparameters otimization
Val. Accuracy Model Parameters
74.87 8 layers 203.59M
72.48 12 layers 389.39M
67.28 16 layers 477.83M
Table 3: Train and validation accuracy on CIFAR100, with hyperparameters discovered by hyperparameters otimization

The model trained with optimal parameters significantly outperforms the default configuration. Notably, increasing model depth adversely affects performance, which may be due to the extreme model size (nearly half a billion parameters) or issues related to vanishing gradients in very deep networks. However, several established methods, such as ResNet-like and DenseNet-like architectures, can mitigate these issues. The primary challenge remains managing the overwhelming number of parameters in these models. Addressing this challenge is essential for improving scalability and performance.

4.3 Bottleneck Kolmogorov-Arnold Convolutional on CIFAR100 and Tiny-Imagenet

4.3.1 CIFAR100

In this section, we conduct a series of experiments on the CIFAR-100 dataset using Bottleneck KAGN convolutional layers. The training parameters are based on the hyperparameter optimization results discussed in Section 4.2.3, all models were trained 200 full epochs. We utilize simple models with 8, 12, and 16 layers, described in Section 4.2.2, a tiny DenseNet model as described in [21] with all convolutional layers replaced by Bottleneck KAGN convolutions, and ResNet-like models, also with convolutional layers replaced by Bottleneck KAGN convolutional or KAGN convolutional layers. This approach aims to evaluate the performance and scalability of the Bottleneck KAGN layers across different architectures. We also provide results for the Mixture of Experts versions of BottleNeck KAGN convolutional layers, described in Section 3.2, with 2 active experts out of 8.

Tables 4 and 3 demonstrate that models using Bottleneck KAGN convolutions perform comparably to those using standard KAGN convolutions but with significantly fewer parameters. However, increasing the number of layers generally degrades model quality. The ResNet18 and ResNet34 variants illustrate that incorporating residual connections mitigates this issue, with ResNet34 outperforming ResNet18 for both KAGN and Bottleneck KAGN convolutions. This indicates that adopting ResNet-like and DenseNet-like approaches can effectively address the challenge of scaling models in depth.

Model Accuracy
Dspike ResNet-18 [22] 74.24
ELU ResNet [4] 73.5
OTTT [23] 71.05
WaveMix-Lite-256/7 [24] 70.20
IM-Loss (VGG-16) [25] 70.18
ResNet18 (modified) [26] 66
ResKAGNet34 (Ours) 0.7814
ResKAGNet34 Bottleneck (Ours) 0.7711
ResKAGNet34 Bottleneck MoE (Ours) 0.7166
ResKAGNet18 (Ours) 0.7656
ResKAGNet18 Bottleneck (Ours) 0.7643
ResKAGNet18 Bottleneck MoE (Ours) 0.7164
Tiny DenseKAGNet (Ours) 0.7777
DenseKAGNet BottleNeck (Ours) 0.769
DenseKAGNet BottleNeck MoE (Ours) 0.7623
16 layers, BottleNeck MoE (Ours) 0.6953
16 layers, BottleNeck (Ours) 0.7065
12 layers, BottleNeck MoE (Ours) 0.7165
12 layers, BottleNeck (Ours) 0.7241
8 layers, BottleNeck MoE (Ours) 0.7439
8 layers, BottleNeck (Ours) 0.7483
Table 4: BottleneckKAGN Convolutional networks, CIFAR 100
Model Accuracy
Tiny DenseNet [21] 60.0
ResNet-18 (AutoMix) [27] 67.33
ResNeXt-50(AutoMix) [27] 70.72
Tiny DenseKAGNet (Ours) 66.07
Tiny DenseKAGNet, BottleNeck (Ours) 65.9
Tiny DenseKAGNet, BottleNeck MoE (Ours) 64.82
VGG19-like, MoE (Ours) 63.72
VGG19-like (Ours) 48.59
VGG16-like, MoE (Ours) 64.27
VGG16-like (Ours) 52.55
VGG13-like, MoE (Ours) 64.57
VGG13-like (Ours) 52.57
VGG11-like, MoE (Ours) 63.22
VGG11-like (Ours) 51.22
Table 5: BottleneckKAGN Convolutional networks, Tiny Imagenet

4.3.2 Tiny ImageNet

In this section, we conduct a series of experiments on the Tiny ImageNet dataset. The training parameters are based on the hyperparameter optimization results discussed in 4.2.3, all models were trained 200 full epochs. We employ VGG-like models where hidden linear layers and the final MaxPool layer are replaced with a Global Average Pooling and a single output layer. Additionally, we use a tiny DenseNet model, as described in [21], with all convolutional layers replaced by Bottleneck KAGN convolutionals. We also perform experiments with Mixture of Experts (MoE) models featuring two active experts out of a total of eight.

The results, presented in Table 5, indicate that VGG-like models decrease in accuracy as the number of layers increases. In contrast, the MoE version shows significantly better performance on this dataset, suggesting that MoE layers can be an effective approach for scaling KAGN-based models in width.

Also, proposed models, VGG-like MoE and Tiny DenseNet, with Gram polynomials KAN convolutions outperform Tiny DenseNet [21] by a significant margin on the Tiny Imagenet dataset.

4.4 Self-KAGNtention

In this section, we empirically investigate the performance of BottleNeck SelfKAGNtention layers and BottleNeckKAGN Focal Modulation. We base our experiments on an architecture with eight convolutional layers, supplemented with three Self-Attention or Focal Modulation layers. The network (see Fig.7 and 8) begins with two BottleNeckKAGN convolution layers, each with 32 filters (256 filters in the Fixed version), both using a 3x3 kernel. This is followed by an attention layer. The next section consists of three BottleNeckKAGN convolution layers, each with 128 filters (256 filters in the Fixed version) and a 3x3 kernel, with the first layer using a stride of 2. Another attention layer is added after these layers. The subsequent section includes three BottleNeckKAGN convolution layers with 256 filters and a 3x3 kernel, with the first layer using a stride of 2. A third attention layer follows. The network concludes with a global max pooling layer and an output layer with 100 nodes.

Refer to caption
Figure 7: Model architecture with attention blocks. As an attention, one could use either SelfKAGNtention or BottleNeckKAGN Focal Modulation layer.
Refer to caption
Figure 8: Fixed model architecture with attention blocks. As an attention, one could use either SelfKAGNtention or BottleNeckKAGN Focal Modulation layer. The main difference between this model and Fig.7 is the number of filters in each layer. The fixed version uses the same filter numbers across all layers.

We use the training and regularization parameters found in Section 4.2.3, categorical cross-entropy as the loss function, and the AdamW optimizer [28]. The models were trained for 150 epochs with a batch size of 256. For Self-Attention layers, we employ bottleneck convolutions with 16 and 64 filters in the first two attention layers to reduce memory and computational resource usage.

We explore two model versions: one with an increasing number of filters with depth and another with a constant number of filters, inspired by transformer architectures. The experimental results on CIFAR-100, presented in Table 6, and compared with Table 3, show that the combination of attention mechanisms with bottleneck convolutions outperforms models without attention while maintaining a significantly lower number of parameters.

Model Width scale Accuracy, CIFAR-100 Parameters, M
SelfKAGNtention 1 71.4 4.9
Fixed SelfKAGNtention 1 77.82 8.7
FocalKAGNtention 1 71.24 4.3
Fixed FocalKAGNtention 1 75.99 7.4
SelfKAGNtention 3 77.69 43.7
Fixed SelfKAGNtention 3 69.72 78.3
FocalKAGNtention 3 76.46 37.8
Fixed FocalKAGNtention 3 7467 63.2
Table 6: BottleneckKAGN Convolutional with Attention layers, CIFAR-100 dataset

4.5 Imagenet1k

In this section, we provide results on ImageNet1k [29] dataset. We have tested several VGG-like [30] models with several modifications. We replaced the last MaxPolling layers and two hidden fully connected layers by GlobalAveragePooling and one output fully connected layer. We also added two extra convolutional layers at the end of the encoder.

The model consists of consecutive 10 Gram ConvKAN Layers or Bottleneck Gram ConvKAN Layers with BatchNorm, polynomial degree equals 5, GlobalAveragePooling, and Linear classification head (see Fig.9 and 10). The network design starts with a KAGN convolution layer with 32 filters (3x3), followed by a max pooling layer (2x2). This pattern is repeated with a KAGN convolution layer with 64 filters (3x3) and another max pooling layer (2x2). The network then includes two consecutive KAGN convolution layers with 128 filters (3x3), followed by another max pooling layer (2x2). Next, there are two KAGN convolution layers with 256 filters (3x3), another max pooling layer (2x2), and two more KAGN convolution layers with 256 filters for the V2 version or 512 filters for the V4 version (3x3). The network concludes with a global average pooling layer and a dense output layer with 1000 nodes.

We also have tested a model with a Self KAN-attention layer, described in 3.3 placed before Global Average pooling.

Refer to caption
Figure 9: The scheme of the VGG11-like v2 model
Refer to caption
Figure 10: The scheme of the VGG11-like v4 model. Please note the difference between V2 and V4 models. V4 models have wider 2 last convolutional layers.

All models, except one, were trained during 200 full epochs with AdamW optimizer, with learning rate equals 0.0009, β1=0.9,β2=0.999formulae-sequencesubscript𝛽10.9subscript𝛽20.999\beta_{1}=0.9,\beta_{2}=0.999italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0.9 , italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0.999, weight decay equals 51065superscript1065\cdot 10^{-6}5 ⋅ 10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT, with 7500 warmup steps and polynomial learning rate scheduler with power 0.30.30.30.3 and learning rate at the end 107superscript10710^{-7}10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT, and batch size equals 32. The model marked with "Opt params" tag was trained with the parameters discovered in Section 4.2.3, and a polynomial degree equals 3 for this model. The models marked with "BottleNeck" use BottleNeck convolutional layers, described in Section3.2.

The results of experiments are presented in Table 7. Please, note that the metrics of our experiments are computed on the validation set of ImageNet1k. Metrics on the test set will be added later. According to the acquired results, the VGG11-like model with bottleneck KAGN convolutions outperforms all models from the VGG family, ResNet-18, ResNet-34, and DenseNet121 models. This fact shows the great potential of the KAN convolutions model.

Model Accuracy, Top-1 Accuracy, Top-5
VGG11 [30] 69.02 88.628
VGG11, w BatchNorm [30] 70.37 89.81
VGG13 [30] 69.928 89.246
VGG13, w BatchNorm [30] 71.586 90.374
VGG16 [30] 71.592 90.382
VGG16, w BatchNorm [30] 73.36 91.516
VGG19 [30] 72.376 90.876
VGG19, w BatchNorm [30] 74.218 91.842
ResNet18 [31] 69.758 89.078
ResNet34 [31] 73.314 91.42
ResNet50 [31] 76.13 92.862
DenseNet121 [5] 74.434 91.972
DenseNet161 [5] 77.138 93.56
VGG11-like, v2 (Ours)111The weights are available on HugginFace 59.1 82.29
VGG11-like, v4 (Ours)222The weights are available on HugginFace 61.17 83.26
VGG11-like, v4, BottleNeck (Ours)333The weights are available on HugginFace 68.50 88.46
VGG11-like, v4, BottleNeck, SA (Ours)444The weights are available on HugginFace 70.684 89.462
VGG11-like, v4, BottleNeck, Opt params (Ours)555The weights are available on HugginFace 74.586 92.13
Table 7: Results on the ImageNet 1K dataset

4.6 Parameter Efficient Finetuning

In this section, we use the HAM10000 ("Human Against Machine with 10000 training images") dataset [32] to explore the proposed PEFT method for Gram KAN convolution models. The dataset consists of 10015 dermatoscopic images. Cases include a representative collection of all important diagnostic categories in the realm of pigmented lesions: Actinic keratoses and intraepithelial carcinoma / Bowen’s disease, basal cell carcinoma, benign keratosis-like lesions, dermatofibroma, melanoma, melanocytic nevi, and vascular lesions. We use the train, validation, and test split hosted on HuggingFace: https://huggingface.co/datasets/marmal88/skin_cancer.

We selected this dataset to evaluate the PEFT method due to its requirement for model adaptation to a novel domain not encountered during pretraining on ImageNet1k. This characteristic makes it an exemplary dataset for our experiments, providing a scenario closely aligned with real-world use cases.

We evaluate several setups with the same training procedure and parameters (see Table 8). We use the VGG11-like, v2 model pretrained on ImageNet1K. All models were trained during 20 full epochs with AdamW optimizer, with learning rate equals 0.0001, β1=0.9,β2=0.999formulae-sequencesubscript𝛽10.9subscript𝛽20.999\beta_{1}=0.9,\beta_{2}=0.999italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0.9 , italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0.999, weight decay equals 51065superscript1065\cdot 10^{-6}5 ⋅ 10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT, with 500 warmup steps and polynomial learning rate scheduler with power 0.30.30.30.3 and learning rate at the end 106superscript10610^{-6}10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT, and batch size equals 12. Because of class imbalance in the dataset, we use Focal Loss [33] instead of categorical cross-entropy.

Setups with "Random weights initialization" value in the "Pretrained degree" column mark setups with random weights initialization with the same parameters as the VGG11-like, v2 model. The column "Trainable Degree" describes a set of trainable parameters of w~g={wi}i=RN+1,R>1formulae-sequencesubscript~𝑤𝑔superscriptsubscriptsubscript𝑤𝑖𝑖𝑅𝑁1𝑅1\tilde{w}_{g}=\{w_{i}\}_{i=R}^{N+1},R>1over~ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT = { italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = italic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N + 1 end_POSTSUPERSCRIPT , italic_R > 1, 6 means that extra polynomial was added to the basis. "Full training" implies that all weights of the encoder were updated during the training process. The "Trainable activation residual" column indicates whether or not wbsubscript𝑤𝑏w_{b}italic_w start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT weights of residual activation were trainable or not.

As we can see from Table 8, PEFT setup with trainable 4-th, 5-th and extra degree polynomial-related weights outperforms all other setups, showing the potential of the proposed method, including full weight fine-tuning. Also we should mention that fine-tuning activation residual weights in general leads to worse performance, so it’s safe to assume that during PEFT procedure these weights could remain frozen.

Training setup Metrics
Pretrained degree Trainable Degree Finetune activation residual Accuracy AUC (OvR) AUC (OvO) Recall (macro) Recall (micro) F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT score (macro) F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT score (micro)
5 4, 5, 6 FALSE 83.89 0.975 0.984 0.842 0.839 0.752 0.839
5 4 FALSE 82.72 0.977 0.983 0.838 0.827 0.777 0.827
5 4, 5, 6 TRUE 81.79 0.971 0.983 0.827 0.818 0.714 0.818
5 4, 5 FALSE 80.7 0.977 0.985 0.827 0.807 0.75 0.807
5 5 FALSE 80.16 0.974 0.98 0.802 0.802 0.758 0.802
5 4, 5 TRUE 77.28 0.966 0.974 0.769 0.773 0.68 0.773
5 Full Training TRUE 76.89 0.976 0.985 0.842 0.769 0.729 0.769
5 5 TRUE 75.8 0.963 0.972 0.783 0.758 0.696 0.758
5 6 FALSE 74.32 0.941 0.94 0.634 0.743 0.636 0.743
5 3 FALSE 74.16 0.945 0.939 0.573 0.742 0.601 0.742
5 4 TRUE 73.07 0.967 0.981 0.84 0.731 0.69 0.731
Random weights initialization Full Training TRUE 71.67 0.958 0.976 0.837 0.717 0.67 0.717
5 2 FALSE 71.28 0.927 0.924 0.507 0.713 0.529 0.713
5 6 TRUE 69.18 0.938 0.947 0.69 0.692 0.639 0.692
5 5, 6 FALSE 67.78 0.936 0.938 0.55 0.678 0.54 0.678
5 2 TRUE 66.69 0.928 0.939 0.634 0.667 0.599 0.667
5 5, 6 TRUE 66.69 0.939 0.946 0.577 0.667 0.564 0.667
5 3 TRUE 58.13 0.944 0.955 0.678 0.581 0.622 0.581
5 Fixed encoder FALSE 48.25 0.864 0.87 0.53 0.482 0.361 0.482
Table 8: PEFT results on HAM10000 dataset

4.7 Segmentation

In this section, we provide empirical evaluation for U-Net-like segmentation models based on KAGN Convolutional layers on BUSI, GlaS, and CVC-ClinicDB datasets.

The BUSI dataset [34] consists of ultrasound images depicting normal, benign, and malignant breast cancer cases, along with their corresponding segmentation maps. In our study, we utilized 647 ultrasound images representing both benign and malignant breast tumors, all consistently resized to 256 × 256 pixels. This dataset provides a comprehensive collection of images that assist in detecting and differentiating various types of breast tumors, offering valuable insights for medical professionals and researchers.

The GlaS dataset [35] includes 612 Standard Definition (SD) frames from 31 sequences, each with a resolution of 384 × 288 pixels, collected from 23 patients. This dataset is associated with the Hospital Clinic in Barcelona, Spain. The sequences were recorded using Olympus Q160AL and Q165L devices, paired with an Extra II video processor. For our study, we specifically used 165 images from the GlaS dataset, all resized to 512 × 512 pixels.

The CVC-ClinicDB dataset [35], also known as "CVC," is a publicly accessible resource for polyp diagnosis within colonoscopy videos. It comprises 612 images, each with a resolution of 384 × 288 pixels, extracted from 31 distinct colonoscopy sequences. These frames offer a diverse array of polyp instances, making them particularly useful for develo** and evaluating polyp detection algorithms. To ensure consistency across different datasets in our study, all images from the CVC-ClinicDB dataset were uniformly resized to 256 × 256 pixels.

We explore 3 different U-net-like models: first is a U-net with convolutions replaced by KAGN Convolutional layers, U2superscript𝑈2U^{2}italic_U start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT-net-like model [12], again with convolutions replaced by KAGN Convolutional layers, and U2superscript𝑈2U^{2}italic_U start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT-net small model, were all filter number were the same across all hidden layers and equal 16 multiplied by the width scale parameter.

For the BUSI, GlaS, and CVC datasets, the batch size was set to 4 and the learning rate was 1e-4, all other parameters were the same as discussed in Section 4.2.3. The loss function was chosen to be a combination of binary cross entropy and dice loss. We randomly split each dataset into 80% training and 20% validation subsets.

The results are presented in Table 9. Our approach outperforms all state-of-the-art methods, including the novel U-KAN model [13]. Qualitative results of the models are provided in Fig.11.

Refer to caption
Figure 11: Examples of model predictions on three diverse medical image datasets
Methods BUSI [34] GlaS [35] CVC [36]
IoU↑ F1↑ IoU↑ F1↑ IoU↑ F1↑
U-Net [6] 57.22 71.91 86.66 92.79 83.79 91.06
Att-Unet [37] 55.18 70.22 86.84 92.89 84.52 91.46
U-Net++ [10] 57.41 72.11 87.07 92.96 84.61 91.53
U-NeXt [38] 59.06 73.08 84.51 91.55 74.83 85.36
Rolling-UNet [39] 61.00 74.67 86.42 92.63 82.87 90.48
U-Mamba [40] 61.81 75.55 87.01 93.0 84.79 91.63
U-KAN [13] 63.38 76.40 87.64 93.37 85.05 91.88
U2superscript𝑈2U^{2}italic_U start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT-KAGNet, Small (Ours) 59.44 74.56 89.13 94.25 85.62 92.26
U2superscript𝑈2U^{2}italic_U start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT-KAGNet, BottleNeck, Small (Ours) 62.95 77.26 85.99 92.47 87.61 93.39
U2superscript𝑈2U^{2}italic_U start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT-KAGNet, BottleNeck (Ours) 55.91 71.72 87.73 93.47 88.86 94.1
U2superscript𝑈2U^{2}italic_U start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT-KAGNet (Ours) 58.13 73.52 88.14 93.7 87.07 93.09
UKAGNet (Ours) 63.45 77.64 87.31 93.23 76.85 86.91
Table 9: Comparison with state-of-the-art segmentation models on three heterogeneous medical datasets.

5 Ablation Study

In this section, we present the results of an ablation study of the Bottleneck Kolmogorov-Arnold convolutional layers with Gram polinomials. The baseline sequence for convolutional KAN models includes an activation residual summed with a nonlinearity φ~~𝜑\tilde{\varphi}over~ start_ARG italic_φ end_ARG, followed by a normalization layer (e.g., batch norm, instance norm), and a SiLU activation. The ablation experiments involve excluding one or more elements from this sequence: activation residual, normalization layer, nonlinearity, and replacing the linear bottleneck with a KAN-based bottleneck layer. Experiments were conducted on the MNIST [18], CIFAR-10, CIFAR-100, and Fashion MNIST [41] datasets, and results are shown in Table 10.

The results show that except for MNIST, the KAN-based bottleneck leads to training collapse or significant accuracy degradation, and using activation residual degrades performance in half of the cases. This suggests that activation residual may be redundant in some cases and warrants further investigation. Nevertheless, we propose retaining the convolutional layer scheme with activation residual, as the bottleneck approach can lead to information loss, and activation residual can help in its recovery.

Model Setup Accuracy, MNIST Accuracy, Fashion-MNIST Accuracy, CIFAR-10 Accuracy, CIFAR-100
8 layers activation skip, linear bottleneck 0.1135 0.8948 0.7392 0.01
8 layers activation skip, linear bottleneck, activation 0.1135 0.1 0.1 0.01
8 layers activation skip, KAN bottleneck, batch norm, activation 0.1135 0.1 0.1 0.01
8 layers linear bottleneck, batch norm, activation 0.9967 0.9424 0.8999 0.6672
8 layers activation skip, linear bottleneck, batch norm, activation 0.9969 0.9348 0.8923 0.6489
4 layers activation skip, linear bottleneck 0.9552 0.815 0.5438 0.2614
4 layers activation skip, linear bottleneck, activation 0.9858 0.83 0.626 0.2801
4 layers activation skip, KAN bottleneck, batch norm, activation 0.9949 0.8879 0.7447 0.4417
4 layers linear bottleneck, batch norm, activation 0.9935 0.9108 0.8021 0.4921
4 layers activation skip, linear bottleneck, batch norm, activation 0.9941 0.9016 0.7962 0.4951
Table 10: Ablation study results

6 Design Principles

Summarizing the experiments conducted, we propose the following preliminary design principles for Kolmogorov-Arnold convolutional networks. It is important to note that our experiments were not exhaustive, and these principles may be revised with new data.

  • We recommend using Gram polynomials for φ~(x)~𝜑𝑥\tilde{\varphi}(x)over~ start_ARG italic_φ end_ARG ( italic_x ). Our experiments indicate this choice excels in both quality metrics and the number of trainable parameters.

  • For scaling Kolmogorov-Arnold convolution-based models, we suggest using the bottleneck version of the layers, which significantly reduces the number of trainable parameters without substantial loss in performance compared to the non-bottleneck version.

  • Increasing model width generally performs better than increasing depth, as shown by our experiments with simple sequential models. The Mixture of Experts versions of the bottleneck convolution effectively scales model width without a significant increase in inference and training costs.

  • Preliminary findings suggest that DenseNet-like architectures could serve as a strong foundation for constructing very deep Kolmogorov-Arnold convolutional networks.

  • Our experiments demonstrate that Self KAGNtention layers can enhance the performance of Kolmogorov-Arnold convolutional models.

  • Kolmogorov-Arnold convolutions perform exceptionally well in segmentation tasks, with U2Net recommended as a starting architecture for further research.

  • As shown in Section 4.2, L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT activation regularization, and Noise Injection layers before the Kolmogorov-Arnold convolutional layer are effective regularization techniques.

7 Conclusion

This paper explores the integration of Kolmogorov-Arnold Networks (KANs) into convolutional neural network architectures, presenting novel approaches and modifications to enhance their performance and efficiency in computer vision tasks. Our work introduces Bottleneck Convolutional Kolmogorov-Arnold layers, a parameter-efficient design that reduces memory requirements and mitigates overfitting issues. Additionally, we propose a parameter-efficient fine-tuning algorithm that significantly decreases the number of trainable parameters needed for adapting pre-trained models to new tasks.

Through extensive empirical evaluations on various datasets, including MNIST, CIFAR10, CIFAR100, Tiny ImageNet, ImageNet1k, HAM10000, BUSI, GlaS, and CVC-ClinicDB, we demonstrate that KAN-based convolutional models can achieve state-of-the-art results in both classification and segmentation tasks. Our experiments highlight the effectiveness of Gram polynomials as the basis function for KANs, the advantages of scaling model width over depth, and the potential of DenseNet-like architectures for very deep networks.

We further show that incorporating Self KAGNtention layers enhances model performance, particularly in complex tasks, and provide design principles for constructing successful KAN convolutional models. Our proposed models not only outperform traditional convolutional networks but also offer a promising direction for future research in optimizing neural network architectures for computer vision applications.

Overall, our findings emphasize the potential of Kolmogorov-Arnold Networks in advancing the capabilities of convolutional neural networks, paving the way for more efficient and effective deep learning models. Future work will focus on refining these approaches and exploring their applications in other domains, as well as investigating additional regularization techniques and optimization strategies to further enhance model performance.

References

  • [1] Yann LeCun, Koray Kavukcuoglu, and Clement Farabet. Convolutional networks and applications in vision. In Proceedings of 2010 IEEE International Symposium on Circuits and Systems, pages 253–256, 2010.
  • [2] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In F. Pereira, C.J. Burges, L. Bottou, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems, volume 25. Curran Associates, Inc., 2012.
  • [3] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021.
  • [4] Anish Shah, Eashan Kadam, Hena Shah, Sameer Shinde, and Sandip Shingade. Deep residual networks with exponential linear unit. In Proceedings of the Third International Symposium on Computer Vision and the Internet, VisionNet’16, page 59–65, New York, NY, USA, 2016. Association for Computing Machinery.
  • [5] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q. Weinberger. Densely connected convolutional networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2261–2269, 2017.
  • [6] Olaf "Ronneberger, Philipp Fischer, and Thomas" Brox. "u-net: Convolutional networks for biomedical image segmentation". In Nassir "Navab, Joachim Hornegger, William M. Wells, and Alejandro F." Frangi, editors, "Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015", pages "234–241", "Cham", "2015". "Springer International Publishing".
  • [7] Ziming Liu, Yixuan Wang, Sachin Vaidya, Fabian Ruehle, James Halverson, Marin Soljačić, Thomas Y. Hou, and Max Tegmark. Kan: Kolmogorov-arnold networks, 2024.
  • [8] Subhransu S. Bhattacharjee. Torchkan: Simplified kan model with variations. https://github.com/1ssb/torchkan/, 2024.
  • [9] Alexander Dylan Bodner, Antonio Santiago Tepsich, Jack Natan Spolski, and Santiago Pourteau. Convolutional kolmogorov-arnold networks, 2024.
  • [10] Zongwei Zhou, Md Mahfuzur Rahman Siddiquee, Nima Tajbakhsh, and Jianming Liang. Unet++: A nested u-net architecture for medical image segmentation. In Deep learning in medical image analysis and multimodal learning for clinical decision support, pages 3–11. Springer, Cham, 2018.
  • [11] Jieneng Chen, Yongyi Lu, Qihang Yu, Xiangde Luo, Ehsan Adeli, Yan Wang, Le Lu, Alan L. Yuille, and Yuyin Zhou. Transunet: Transformers make strong encoders for medical image segmentation, 2021.
  • [12] Xuebin Qin, Zichen Zhang, Chenyang Huang, Masood Dehghan, Osmar R. Zaiane, and Martin Jagersand. U2-net: Going deeper with nested u-structure for salient object detection. Pattern Recognition, 106:107404, 2020.
  • [13] Chenxin Li, Xinyu Liu, Wuyang Li, Cheng Wang, Hengyu Liu, and Yixuan Yuan. U-kan makes strong backbone for medical image segmentation and generation, 2024.
  • [14] Tawan. Gram: Kan meets gram polynomials. https://github.com/Khochawongwat/GRAMKAN, 2024.
  • [15] Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer, 2017.
  • [16] Hai** Wu, Bin Xiao, Noel Codella, Mengchen Liu, Xiyang Dai, Lu Yuan, and Lei Zhang. Cvt: Introducing convolutions to vision transformers. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 22–31, 2021.
  • [17] Jianwei Yang, Chunyuan Li, Xiyang Dai, and Jianfeng Gao. Focal modulation networks. Advances in Neural Information Processing Systems, 35:4203–4217, 2022.
  • [18] Li Deng. The mnist database of handwritten digit images for machine learning research. IEEE Signal Processing Magazine, 29(6):141–142, 2012.
  • [19] Alex Krizhevsky. Learning multiple layers of features from tiny images. 2009.
  • [20] Ya Le and Xuan S. Yang. Tiny imagenet visual recognition challenge. 2015.
  • [21] Zoheb Abai and Nishad Rajmalwar. Densenet models for tiny imagenet classification. ArXiv, abs/1904.10429, 2019.
  • [22] Yuhang Li, Yufei Guo, Shanghang Zhang, Shikuang Deng, Yongqing Hai, and Shi Gu. Differentiable spike: Rethinking gradient-descent for training spiking neural networks. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 23426–23439. Curran Associates, Inc., 2021.
  • [23] Mingqing Xiao, Qingyan Meng, Zongpeng Zhang, Di He, and Zhouchen Lin. Online training through time for spiking neural networks. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022.
  • [24] Pranav Jeevan, Kavitha Viswanathan, Anandu A S, and Amit Sethi. Wavemix: A resource-efficient neural network for image analysis, 2024.
  • [25] Yufei Guo, Yuanpei Chen, Liwen Zhang, Xiaode Liu, Yinglei Wang, Xuhui Huang, and Zhe Ma. Im-loss: Information maximization loss for spiking neural networks. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 156–166. Curran Associates, Inc., 2022.
  • [26] Riad Ibadulla, Thomas M. Chen, and Constantino Carlos Reyes-Aldasoro. Fatnet: High-resolution kernels for classification using fully convolutional optical neural networks. AI, 4(2):361–374, 2023.
  • [27] Zicheng "Liu, Siyuan Li, Di Wu, Zihan Liu, Zhiyuan Chen, Lirong Wu, and Stan Z." Li. "automix: Unveiling the power of mixup for stronger classifiers". In Shai "Avidan, Gabriel Brostow, Moustapha Cissé, Giovanni Maria Farinella, and Tal" Hassner, editors, "Computer Vision – ECCV 2022", pages "441–458", "Cham", "2022". "Springer Nature Switzerland".
  • [28] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In International Conference on Learning Representations, 2017.
  • [29] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael S. Bernstein, Alexander C. Berg, and Li Fei-Fei. Imagenet large scale visual recognition challenge. CoRR, abs/1409.0575, 2014.
  • [30] K Simonyan and A Zisserman. Very deep convolutional networks for large-scale image recognition. pages 1–14. Computational and Biological Learning Society, 2015.
  • [31] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016.
  • [32] Philipp Tschandl, Cliff Rosendahl, and Harald Kittler. The ham10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions. Scientific Data, 5(1), August 2018.
  • [33] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. Focal loss for dense object detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(2):318–327, 2020.
  • [34] Walid Al-Dhabyani, Mohammed Gomaa, Hussien Khaled, and Aly Fahmy. Dataset of breast ultrasound images. Data in brief, 28:104863, 2020.
  • [35] Jeya Maria Jose Valanarasu, Poojan Oza, Ilker Hacihaliloglu, and Vishal M Patel. Medical transformer: Gated axial-attention for medical image segmentation. In Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part I 24, pages 36–46. Springer, 2021.
  • [36] Jorge Bernal, F Javier Sánchez, Gloria Fernández-Esparrach, Debora Gil, Cristina Rodríguez, and Fernando Vilariño. Wm-dova maps for accurate polyp highlighting in colonoscopy: Validation vs. saliency maps from physicians. Computerized medical imaging and graphics, 43:99–111, 2015.
  • [37] Ozan Oktay, Jo Schlemper, Loic Le Folgoc, Matthew Lee, Mattias Heinrich, Kazunari Misawa, Kensaku Mori, Steven McDonagh, Nils Y Hammerla, Bernhard Kainz, et al. Attention u-net: Learning where to look for the pancreas. arXiv preprint arXiv:1804.03999, 2018.
  • [38] Jeya Maria Jose Valanarasu and Vishal M Patel. Unext: Mlp-based rapid medical image segmentation network. In International conference on medical image computing and computer-assisted intervention, pages 23–33. Springer, 2022.
  • [39] Yutong Liu, Haijiang Zhu, Mengting Liu, Huaiyuan Yu, Zihan Chen, and Jie Gao. Rolling-unet: Revitalizing mlp’s ability to efficiently extract long-distance dependencies for medical image segmentation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 3819–3827, 2024.
  • [40] Jun Ma, Feifei Li, and Bo Wang. U-mamba: Enhancing long-range dependency for biomedical image segmentation. arXiv preprint arXiv:2401.04722, 2024.
  • [41] Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms, 2017.