Kolmogorov-Arnold Convolutions: Design Principles and Empirical Studies

Ivan Drokin
Deep Learning Researcher
Seath the Scaleless Research Group
[email protected]

Abstract

The emergence of Kolmogorov-Arnold Networks (KANs) has sparked significant interest and debate within the scientific community. This paper explores the application of KANs in the domain of computer vision (CV). We examine the convolutional version of KANs, considering various nonlinearity options beyond splines, such as Wavelet transforms and a range of polynomials. We propose a parameter-efficient design for Kolmogorov-Arnold convolutional layers and a parameter-efficient finetuning algorithm for pre-trained KAN models, as well as KAN convolutional versions of self-attention and focal modulation layers. We provide empirical evaluations conducted on MNIST, CIFAR10, CIFAR100, Tiny ImageNet, ImageNet1k, and HAM10000 datasets for image classification tasks. Additionally, we explore segmentation tasks, proposing U-Net-like architectures with KAN convolutions, and achieving state-of-the-art results on BUSI, GlaS, and CVC datasets. We summarized all of out finding in a preliminary design guide of KAN convolutional models for computer vision tasks. Furthermore, we investigate regularization techniques for KANs. All experimental code and implementations of convolutional layers and models, pre-trained on ImageNet1k weights are available on GitHub: https://github.com/IvanDrokin/torch-conv-kan.

Keywords Convolutional Kolmogorov-Arnold Networks $\cdot$ Kolmogorov-Arnold Networks $\cdot$ Computer Vision

1 Introduction

The rapid evolution of deep learning architectures has significantly advanced the field of computer vision, particularly in tasks that require the analysis of complex spatial data. Convolutional Neural Networks (CNNs), initially proposed by LeCun et al. [1], have become a cornerstone in this domain due to their ability to efficiently process high-dimensional data arrays such as images. These networks typically employ linear transformations followed by activation functions in their convolutional layers to discern spatial relationships, thereby reducing the number of parameters needed to capture intricate patterns in visual data. Since 2012, following the success of AlexNet [2] in the ImageNet classification challenge, CNNs have dominated the field of computer vision until the emergence of Vision Transformers [3]. Innovations such as Residual Networks [4] and Densely Connected networks [5], along with numerous subsequent works, have significantly advanced the achievable quality of models based on convolutional layers, enabling the effective training of very large and deep networks.

In segmentation tasks, especially within the biomedical domain, CNNs have also become foundational with the advent of the U-Net [6] architecture, which has subsequently inspired a whole family of U-Net-like architectures for segmentation tasks.

Recent developments in deep learning have seen the integration of sophisticated mathematical theories into neural network architectures, enhancing their capability to handle complex data structures. One such innovation is the Kolmogorov-Arnold Network (KAN) [7], which leverages the Kolmogorov-Arnold theorem to incorporate splines into its architecture, offering a compelling alternative to traditional Multi-Layer Perceptrons (MLPs). Quickly following the original work introducing KANs, modifications, and improvements have emerged that attempt to overcome various issues associated with the spline-based approach, namely computational overhead and a large number of trainable parameters.

In light of these advancements, this paper explores the adaptation of KANs to convolutional layers—a prevalent component in many CNN architectures used for image classification. Traditional CNNs, while effective, often rely on fixed activation functions and linear transformations, which can be enhanced through the flexibility and reduced parametric complexity offered by KANs. In this work, we explore various modifications of the original KAN model, propose efficient designs for convolutional KAN models, conduct empirical evaluations of the proposed approaches, and compare them with classical convolutional networks.

Our contributions can be summarized as follows:

1.

We present the Bottleneck Convolutional Kolmogorov-Arnold layer, which retains the properties of Kolmogorov-Arnold layers while significantly reducing memory requirements.
2.

We empirically investigate several modifications of convolutional KAN models in the context of image classification tasks, including studies on regularization and hyperparameter tuning, across multiple datasets.
3.

We introduce a parameter-efficient finetuning algorithm for Gram polynomials in Convolutional Kolmogorov-Arnold Networks, which substantially reduces the number of trainable parameters needed when adapting a pre-trained model to new tasks.
4.

We redesign U-Net-like models for segmentation tasks with Kolmogorov Arnold Convolutional layers instead if regular convolutions, demonstrating that such models achieve state-of-the-art results on three diverse biomedical datasets.
5.

We propose redesigns of Self-Attention and Focal Modulation layers based on Bottleneck Convolutional Kolmogorov-Arnold layers, which significantly improve the performance of classification models.
6.

Based on the results of our empirical studies, we formulate design principles for constructing successful computer vision models based on Bottleneck Convolutional Kolmogorov-Arnold layers.
7.

We provide the entire codebase on GitHub, and pre-trained weights on ImageNet are available on HuggingFace to accelerate research and ensure reproducibility.

Our work is structured as follows. In Section 2, we present an overview of relevant works on Kolmogorov-Arnold networks. In Section 3, we describe the Kolmogorov-Arnold convolutional layer, its bottleneck version, attention and focal modulation layers, regularization techniques for the Kolmogorov-Arnold convolutional layer, and introduce and describe the PEFT algorithm for polynomial variants of Kolmogorov-Arnold convolutional networks. In Section 4, we present experiments on image classification and segmentation tasks. In Section 5, we conduct an ablation study for the bottleneck Kolmogorov-Arnold convolutional layer. In Section 6, we summarize all experimental results into a design guide for constructing Kolmogorov-Arnold convolutional networks.

2 Related works

The application of the Kolmogorov-Arnold theorem in neural networks marks a significant theoretical integration that enhances the expressiveness and efficiency of neural models. The theorem, which provides a way to represent any multivariate continuous function as a composition of univariate functions and additions, has been adapted in the design of Kolmogorov-Arnold Networks (KANs). KANs differ from traditional Multi-Layer Perceptrons (MLPs) by replacing linear weight matrices with learnable splines, thus reducing the number of parameters required and potentially improving the generalization capabilities of the network.

Recent research has proposed several variations of KANs to address specific limitations and enhance their performance.

Fast KAN by Li et al. (2024) introduced an adaptation where B-splines are replaced by Radial Basis Functions (RBFs). This modification aims to reduce the computational overhead associated with splines. The work demonstrated that third-order B-splines used in traditional KANs could be effectively approximated by Gaussian radial basis functions, resulting in FastKAN—a faster implementation of KAN which also functions as an RBF network.

Wavelet-based KANs (Wav-KAN), as presented by Bozorgasl et al. (2024), incorporate wavelet functions into the KAN structure to enhance both interpretability and performance. Wav-KAN leverages the properties of wavelets to efficiently capture high-frequency and low-frequency components of input data, balancing accurate representation of data structures with robustness against overfitting. The implementation employs discrete wavelet transforms (DWT) for multiresolution analysis, which simplifies the computation process. Wav-KAN has demonstrated enhanced accuracy, faster training speeds, and increased robustness compared to traditional Spl-KAN and MLPs.

ReLU-KAN by Qiu et al. (2024) addresses the computational complexity of basis function calculations in traditional KANs by introducing a novel implementation that utilizes ReLU (Rectified Linear Unit) and point-wise multiplication. This approach optimizes the computation process for efficient CUDA computing, achieving a significant speedup (20x) over traditional KANs while maintaining stable training and superior fitting ability. ReLU-KAN preserves the "catastrophic forgetting avoidance" property of KANs, making it a practical choice for both inference and training within existing deep learning frameworks.

Polynomial-based variations of KANs have also been explored. Chebyshev KAN replaces B-splines with Chebyshev polynomials, which are known for their excellent function approximation capabilities and can be calculated recursively. This approach aims to improve the performance and intuitiveness of KANs. Meanwhile, Gram KAN leverages the simplicity of Gram polynomial transformations, characterized by their discrete nature. This discrete approach is particularly suited for handling discretized datasets like images and text data, offering a novel method for data processing in neural networks.

Kolmogorov Arnold Legendre Network (KAL-Net) [8] represents a novel architecture using Legendre polynomials to surpass traditional polynomial approximations like splines in KANs. KAL-Net utilizes Legendre polynomials up to a specific order for input normalization, capturing nonlinear relationships more efficiently. By leveraging caching mechanisms and recurrence relations, KAL-Net enhances computational efficiency. It employs SiLU (Sigmoid Linear Unit) activation functions and layer normalization to stabilize outputs and improve training stability. KAL-Net has demonstrated remarkable accuracy (97.8%) on the MNIST dataset and efficiency, with an average forward pass taking only 500 microseconds, highlighting its potential in handling complex image patterns and computational efficiency.

Convolutional Kolmogorov-Arnold Networks (Convolutional KANs) [9] were introduced as an alternative to standard Convolutional Neural Networks (CNNs) in a recent study. This approach integrates the non-linear activation functions from KANs into convolutional layers, creating a new layer type that maintains similar accuracy levels to traditional CNNs while using half the number of parameters. This significant reduction in parameters highlights a promising direction for optimizing neural network architectures.

In the domain of medical image segmentation, U-Net [6] has become a foundational architecture, utilizing an encoder-decoder structure to effectively capture image features. Variations such as U-Net++ [10] improve segmentation accuracy through nested structures that fuse multi-scale features. Beyond convolution-based methods, transformer-based models like Vision Transformer [3] and TransUNet [11] have shown effectiveness in incorporating global context into medical image segmentation.

$U^{2}$ -Net [12], designed for salient object detection, employs a two-level nested U-structure to capture contextual information across different scales. This architecture increases depth without significantly increasing computational cost, offering competitive performance on various datasets.

Li et al. [13] investigate and redesign the established U-Net pipeline by integrating Kolmogorov-Arnold Network (KAN) layers into the intermediate tokenized representations, resulting in the U-KAN architecture.

3 Method

This section details the methods used in our study, focusing on Kolmogorov-Arnold convolutions and their various adaptations and enhancements. The subsections cover the following topics:

In Section 3.1, we provide a brief overview of Kolmogorov-Arnold convolutions, as initially presented in [9], including their formalization and various basis function options such as splines, Radial-Basis Functions, Wavelets, and polynomials. We also introduce the use of Gram polynomials for parameter-efficient fine-tuning.

Section 3.2 addresses the primary issue with Kolmogorov-Arnold Convolutions—the high number of parameters introduced by the basis functions. We propose a bottleneck version to mitigate this problem, involving a squeezing convolution before and an expanding convolution after applying the basis function. This design includes a mixture of experts for effective implementation.

In Section 3.3, we describe the construction of Self-KAGtention layers by substituting traditional convolutions with Kolmogorov-Arnold convolutional layers. Additionally, we introduce Focal KAGN Modulation, where all convolutional layers in the original focal modulation are replaced with Kolmogorov-Arnold convolutional layers.

Section 3.4 discusses various regularization techniques, including weight and activation penalties, dropout placements, and additive Gaussian noise injection. We explore the impact of these techniques on Kolmogorov-Arnold convolutional networks.

Finally in Section 3.5, we introduce the parameter-efficient finetuning algorithm for polynomial variants of Kolmogorov-Arnold convolutional networks. We outline several options for parameter-efficient fine-tuning, aiming to reduce the number of trainable parameters while adapting pre-trained models to new tasks.

3.1 Kolmogorov-Arnold Convolutions

Kolmogorov-Arnold convolutions were presented in [9], in this section we briefly cover the formalization of Kolmogorov-Arnold convolutions.

Kolmogorov-Arnold Convolutions could be stated as follows: the kernel consists of a set of univariate non-linear functions. Suppose we have an input image $y,y\in R^{c\times n\times m}$ , where $c$ is the number of channels, and $n,m$ are the height and width of an image respectively. Then, KAN-based convolutions with kernel size $k$ could be defined as:

\displaystyle x_{ij}=\sum_{d=1}^{c}\sum_{a=0}^{k-1}\sum_{b=0}^{k-1}\varphi_{a,% b,d}(y_{d,i+a,j+b});i=\overline{1,n-k+1},j=\overline{1,m-k+1}

Each $\varphi$ is a univariate non-linear learnable function with its own trainable parameters. In the original paper[7], the authors propose to use this form of the functions:

	$\displaystyle\varphi=w_{b}\cdot b(x)+\tilde{\varphi}(x),$
	$\displaystyle\tilde{\varphi}(x)=w_{s}\cdot Spline(x)$
	$\displaystyle b(x)=SiLU(x)=x/(1+e^{-x})$

Similar to KANs, other than splines could be chosen as basis function $\tilde{\varphi}(x)$ : Radial-Basis Function, Wavelets, polynomials, etc.

Replacing splines with Gram polynomials was proposed in [14]. The Gram polynomials, or the discrete Chebyshev polynomial, $t^{N}_{n}(x)$ is a polynomial of degree $n$ in $x$ , for $n=0,1,2,\ldots,N-1$ , constructed such that two polynomials of unequal degree are orthogonal with respect to the weight function $w(x)=\sum_{r=0}^{N-1}\delta(x-r),$ with $\delta(\cdot)$ being the Dirac delta function. That is, $\int_{-\infty}^{\infty}t^{N}_{n}(x)t^{N}_{m}(x)w(x)\,dx=0\quad\text{ if }\quad n% \neq m.$ In the case of splines by Gram polynomials, KAN Convolutions are defined as follows.

	$\displaystyle\varphi=w_{b}\cdot b(x)+\tilde{\varphi}(x),$
	$\displaystyle\tilde{\varphi}(x)=\sum_{i=0}^{N+1}w_{i}\cdot t^{i}_{n}(x)$
	$\displaystyle b(x)=SiLU(x)=x/(1+e^{-x})$

This reformulation, on one side, allows an option for parameter-efficient fine-tuning of the pre-trained model, and on the other hand, it reduces the number of trainable parameters.

3.2 Bottleneck Kolmogorov-Arnold Convolutions

Refer to caption — Figure 1: KAN Convolution (left) and Bottleneck KAN Convolution (right). The main difference between these two types of layers is a encoder-decoder convolutional layers on the right data stream.

The main problem with KAN Convolutions lies in the $Spline$ part of the model. Whatever type of basis function one chooses, the basis introduces a lot of parameters to the model, which leads to higher resource requirements during training and increases the probability of overfitting. To overcome those issues, we propose to use Bottleneck Kolmogorov-Arnold Convolutions (see Fig.1).

Before applying basis to input data, we propose to use squeezing convolution with kernel size equal to 1 before applying basis function to input and expanding convolution with kernel size equal to 1 after. Intuitively, it could be considered a one-layer encoder that helps extract meaningful features from the input before processing it via a chosen basis, and then a one-layer decoder decodes it back. residual activation helps to preserve necessary details that could be lost during encoding and decoding of the input.

We compare the number of trainable parameters across various layer types (see Fig.4). The reduction parameter in the Bottleneck KAGN Conv 2D layer controls the ratio of $d_{x}$ and $d_{\phi}$ : $reduction=d_{x}/d_{\varphi}$ . This parameter is crucial for determining the efficiency and scalability of the model. By adjusting the reduction parameter, we can balance the complexity and performance of the model, emphasizing the importance of careful parameter management in deep learning architectures.

Such design allows us to implement a mixture of experts effectively. We utilize sparsely-gated mixture-of-experts [15] in-between the encoder and decoder convolutions, with a set of $\tilde{\varphi_{i}}(x)$ as experts (see Fig.2).

3.3 Kolmogorov-Arnold Sefl-Attention and Focal Modulation

In [16], a self-attention layer was introduced where the K, Q, and V projections were replaced by convolutional layers instead of linear projections. In this paper, we propose a similar approach to construct Self-KAGtention layers by substituting traditional convolutions with KAN convolutional layers. Given that the self-attention operation requires $O(n^{2})$ memory, where $n$ is the number of tokens or tensor pixels in convolutional models, we suggest using optional bottleneck convolutions with a $1\times 1$ kernel and placing the self-attention layer between these two convolutions. We call this version Self-KAGNtention.

Yang et al. [17] introduce the Focal Modulation layer to serve as a seamless replacement for the Self-Attention Layer. The layer boasts high interpretability, making it a valuable tool for Deep Learning practitioners. Here we propose a Focal KAGN Modulation, where all convolutional layers from the original focal modulation are replaced with KAN convolutional layers. As the Focal Modulation layer uses grouped convolutions in the hierarchical contextualization stream of data processing with the number of groups equals to the filter number, we note that in this case, Bottleneck KAN Convolutions should be replaced by KAN Convolutions.

3.4 Regularizations in Kolmogorov-Arnold Convolutional

Applying regularization techniques to Kolmogorov-Arnold Convolutional involves straightforward weight and activation penalties. However, dropout requires careful consideration of its placement. Let’s describe the polynomial version of Kolmogorov-Arnold Convolutional. Instead of computing splines over x, we could use Gram, Chebyshev, Legendre, and other Polynomials. In this case, we first need to compute the polynomial basis over x, and then perform the weighted sum of. (Fig.3). In that case, we have 3 possible options for dropout placement: before the layer (we will refer to this position as "Full"), before polynomial basis calculation (we will refer to this position as "Poly"), and before weight application to polynomials (we will refer to this position as "Degree").

The authors of [7] state the benefits of KANs over MLPs, and one of them was robustness to noise and adversarial attacks. From this observation, we could derive an alternative way of regularization. Instead of zeroing out some fraction of neurons, we could add additive Gaussian noise to a layer’s input, thus forcing the model to filter this noise and be more robust against noise in unseen data. Similar to dropout, there are three placements available: "Full", "Poly", and "Degree". More formal, for a given neuron $y_{i}=f(w_{i}\cdot x_{i})$ , during noise injection, the neuron output is:

\displaystyle y_{i}^{\prime}=\begin{cases}y_{i}+\alpha\varepsilon_{i},% \varepsilon_{i}\sim N(0,\sigma^{2}(y))&\text{with probability }p\\ y_{i}&\text{with probability }1-p\end{cases}

In this equation, $\alpha$ is a parameter that controls the amount of added noise, and $\sigma^{2}(y$ ) is a variance of the input, computed for each input channel.

3.5 Parameter Efficient Finetuning

Let’s assume we have a pre-trained model $L$ with Gram KAN convolution layers and we want to fine-tune this model on downstream tasks, e.g. classification dataset $D=\{x_{i},y_{i}\}_{i=1}^{T}$ . Let $G_{i}(w^{i}_{b},w^{i}_{g})$ be a $i$ -th layer of the model $L$ , and $w^{i}_{b}$ are the weights of residual activation and $w^{i}_{g}=\{w^{i}_{j}\}_{j=0}^{N+1}$ are the weights of Gram polynomials (see formula 3.1). Assume that model $L$ has $M$ Gram KAN convolutional layers.

Then we have several options for parameter-efficient fine-tuning for a downstream task.

•

Only $\tilde{w}_{g}=\{w^{i}_{j}\}_{j=R}^{N+1},R>1$ are trainable parameters, while all other parameters of $G_{i}(w^{i}_{b},w^{i}_{g})$ for all $i$ are frozen.
•

New parameter $\tilde{w^{i}}_{N+2}$ could be introduced and the Gram basis will be expanded by one (or more) polynomial of higher degree.

Then we can formalize the Parameter Efficient Finetuning algorithm for Gram KAN convolutional models as follows.

\displaystyle\mathcal{C}(L(x_{t}),y_{t})\xrightarrow[\{w^{i}_{j}\}_{j=R_{l}}^{% R_{u}},i=1\dots M]{t=1\dots T}min

Where $\mathcal{C}$ is a loss function, $R_{l}$ and $R_{u}$ define the minimum and maximum degrees of Gram polynomials, respectively, at which the corresponding weights will be updated. In case when $R_{u}>N+1$ we add additional polynomials to the basis and initialize them with zeros.

These fine-tuning options sufficiently reduce the number of trainable parameters. Intuitively, one could consider this method as a refining of high-order features to match new data distribution. Empirical evaluation of this method is presented in section 4.6.

4 Experiments

In this section, we present the following experiments. Section 4.1 includes experiments on the MNIST [18], CIFAR10, and CIFAR100 [19] datasets with different formalizations of KAN convolutions. Section 4.2 presents experiments with various regularizations and hyperparameter optimization. Section 4.3 provides results for Bottleneck convolutions on CIFAR100 and Tiny ImageNet [20] and also considers ResNet-like and DenseNet-like architectures. Section 4.4 presents results of self-attention layers experiments. Section 4.5 presents results for ImageNet1k dataset, and Section 4.6 presents results of parameter-efficient finetuning. Section 4.7 presents results for segmentation tasks.

4.1 Baseline on MNIST, CIFAR10 and CIFAR100

Baseline models were chosen to be simple networks with 4 and 8 convolutional layers. To reduce the feature’s spatial dimensions, convolutions with dilation=2 were used. In the 4-layer model, the second and third convolutions had dilation=2, while in the 8-layer model, the second, third, and sixth convolutions had dilation=2.

The number of channels in the convolutions was the same for all models.

•

For 4 layers: 32, 64, 128, 512
•

For 8 layers: 2, 64, 128, 512, 1024, 1024, 1024, 1024

After the convolutions, Global Average Pooling was applied, followed by a linear output layer.

In the case of classic convolutions, a traditional structure was used: convolution - batch normalization - ReLU. In the case of KAN convolutions, after KAN convolution layer batch normalization and SiLU are applied. All experiments were conducted on an NVIDIA RTX 3090 with identical training parameters.

In this section, we are investigating performance of several KANs options: spline-based (KANConv), RBF-based (FastKANConv), Legendre polynomials version (KALNConv), Chebyshev polynomials (KACNConv), Gram polynomials (KAGNConv) and Wavelet KANs (WavKANConv).

Model	MNIST			CIFAR10			CIFAR100
Model	Val. Accuracy	Params., M	Eval Time, s	Val. Accuracy	Params.,	Eval Time, s	Val. Accuracy	Params.,	Eval Time, s
Conv, 4 layers, baseline	99.42	0.1	0.7008	73.18	0.1	1.8321	42.29	0.12	1.5994
KANConv, 4 layers	99.00	3.49	2.6401	52.08	3.49	3.7972	21.78	3.52	4.0262
FastKANConv, 4 layers	97.65	3.49	1.5999	64.95	3.49	2.3716	34.32	3.52	2.7457
KALNConv, 4 layers	84.85	1.94	1.7205	10.28	1.94	3.0527	5.97	1.97	3.0919
KACNConv, 4 layers	97.62	3.92	1.6710	52.01	3.92	2.3972	23.17	0.42	2.6522
KAGNConv, 4 layers	99.49	0.49	1.7253	65.84	0.49	2.2570	47.36	1.97	2.3399
WavKANConv, 4 layers	99.23	0.95	7.4622	73.63	0.95	11.2276	41.50	0.98	11.4744
Conv, 8 layers, baseline	99.63	1.14	1.2061	83.05	1.14	1.8258	57.52	1.19	1.8265
KANConv, 8 layers	99.37	40.7	4.2011	74.66	40.7	5.4858	36.18	40.74	5.7067
FastKANConv, 8 layers	99.49	40.7	2.1653	74.66	40.7	5.4858	43.32	40.74	2.7771
KALNConv, 8 layers	49.97	22.61	1.7815	15.97	22.61	2.7348	1.74	22.65	2.6863
KACNConv, 8 layers	99.32	18.09	1.6973	62.14	18.09	2.3459	25.01	18.14	2.3826
KAGNConv, 8 layers	99.68	22.61	2.2402	84.14	22.61	2.5849	59.27	22.66	2.6460
WavKANConv, 8 layers	99.57	10.73	59.1734	85.37	10.73	28.0385	55.43	10.78	30.5438

Table 1: Results on MNIST, CIFAR10, and CIFAR100 datasets

As we can see from the Table 1, Gram polynomials-based and Wavelet-based versions perform better than other other options, and outperform vanilla convolutions. Due to wavelet-based KANs’ higher computational resource requirements, we will focus on Gram KANs as the main basis function option in further research.

4.2 Regularization study and hyperparameters optimization

4.2.1 Regularization study

Baseline model was chosen to be simple networks with 8 convolutional layers with Gram polynomials as basis functions. To reduce the feature’s spatial dimensions, convolutions with dilation=2 were used: the second, third, and sixth convolutions had dilation=2.

We explore two sets of convolutional layer filters:

•

Slim: 16, 32, 64, 128, 256, 256, 512, 512
•

Wide: 32, 64, 128, 256, 512, 512, 1024, 1024

After the convolutions, Global Average Pooling was applied, followed by a linear output layer. In these experiments, a dropout after Global Pooling hasn’t been used.

All experiments were conducted on an NVIDIA RTX 3090 with identical training parameters. As part of this research, we aim to find answers to the following questions:

•

What is the impact of $L_{1}$ and $L_{2}$ activation penalties of ConvKAN layers on the model?
•

What is the impact of $L_{1}$ regularization of the weights of ConvKAN layers on the model?
•

How does the dropout placement within a ConvKAN layer impact the model? Essentially, there are three placements: before the layer (Full), before the calculation of the polynomial basis (Poly), and before applying the weights to the calculated polynomials (Degree).
•

Since KAN models are supposed to be more robust to noise in the data, can we replace the Dropout layer with additive Gaussian noise as a regularization technique?

Regularization	Slim		Wide
Regularization	Train Accuracy	Val. Accuracy	Train Accuracy	Val. Accuracy
None	73.73	61.30	86.88	67.04
$L_{1}$ Activation, 1e-08	74.49	61.20	86.69	66.41
$L_{1}$ Activation, 1e-07	74.64	61.41	86.60	66.88
$L_{1}$ Activation, 1e-06	74.11	61.74	86.73	67.09
$L_{2}$ Activation, 1e-08	73.85	61.36	86.61	67.49
$L_{2}$ Activation, 1e-07	73.96	61.55	86.78	66.81
$L_{2}$ Activation, 1e-06	73.10	60.29	86.99	66.30
$L_{1}$ Weight, 1e-08	74.31	61.39	86.41	66.79
$L_{1}$ Weight, 1e-07	73.48	61.08	86.41	66.79
$L_{1}$ Weight, 1e-06	73.60	60.96	87.10	66.83
Poly Dropout, 0.05	68.69	60.98	84.97	66.83
Full Dropout, 0.05	60.98	63.64	79.53	68.93
Degree Dropout, 0.05	70.12	62.91	85.01	67.75
Poly Dropout, 0.15	65.43	58.63	82.07	62.26
Full Dropout, 0.15	43.77	56.10	63.19	66.38
Degree Dropout, 0.15	64.37	62.28	81.24	68.70
Poly Dropout, 0.25	61.93	51.27	80.22	59.70
Full Dropout, 0.25	32.99	47.65	49.49	59.56
Degree Dropout, 0.25	60.09	60.69	77.46	67.85
Poly Noise Injection, 0.05	70.42	62.06	84.65	67.11
Full Noise Injection, 0.05	59.75	62.63	79.71	69.18
Degree Noise Injection, 0.05	69.45	62.40	85.00	68.43
Poly Noise Injection, 0.15	65.75	57.64	82.78	63.82
Full Noise Injection, 0.15	43.78	56.15	63.30	66.07
Degree Noise Injection, 0.15	64.52	61.78	81.33	68.23
Poly Noise Injection, 0.25	63.38	53.14	80.74	59.96
Full Noise Injection, 0.25	33.54	48.24	49.58	59.70
Degree Noise Injection, 0.25	60.17	60.98	77.26	67.82

Table 2: Regularization study on CIFAR10 dataset

Based on the conducted experiments and results presented in Table 2, the following conclusions can be drawn:

•

It seems that Full Dropout and Full Noise Injection are the best options for regularization, hel** to combat model overfitting.
•

$L_{1}$ / $L_{2}$ activation penalties, as well as $L_{1}$ weight penalty, slightly improve the situation, but not significantly.
•

In a wider model, the impact of regularization is greater.

4.2.2 Scaling KANs

In classical convolutional networks, we have two major options for scaling up models: we can go deeper and stack more layers, or we can go wider and expand the number of convolutional filters. There are other ways to scale up models, like leveraging a mixture of experts.

KAN Convs with Gram polynomials as basis functions provide us with another possibility for scaling: instead of inflating channel numbers or adding new layers, we could increase the degree of polynomials.

During the experiments, we used the same augmentations as in the previous section, NoiseInjection in Full positions for regularization with p = 0.05 and linear dropout with p = 0.05.

The baseline model was chosen to be simple networks with 8, 12, and 16 convolutional layers with Gram polynomials as basis functions. To reduce the feature’s spatial dimensions, convolutions with dilation=2 were used: the second, third, and sixth convolutions had dilation=2 for 8 and 12 layers models, and the second, fourth, and eighth for 16 layers models.

The models have the following sets of convolutional layer filters with a width scale equal to 1:

•

8 layers: 16, 32, 64, 128, 256, 256, 512, 512
•

12 layers: 16, 32, 64, 128, 256, 256, 256, 256, 256, 512, 512, 512
•

16 layers: 16, 16, 32, 32, 64, 64, 128, 128, 256, 256, 256, 256, 512, 512, 512, 512

Our findings indicate that scaling model width yields better performance than increasing model depth. Additionally, there is no observed benefit in scaling the Gram’s degree of the model. However, it is noteworthy that the number of trainable parameters increases rapidly in any configuration, whether depth, width, or degree is scaled. The lack of benefits from depth and degree scaling may be attributed to the relatively small dataset size. It is plausible that larger datasets could produce different outcomes.

4.2.3 Hyperparameters tuning

To identify an optimal set of hyperparameters and mitigate the risk of overfitting to the test set, we partitioned the CIFAR100 training dataset into new training and validation sets in an 80/20 ratio. Following the completion of the hyperparameter search, we trained the model on the entire CIFAR100 training set and evaluated it on the complete test set. This study aims to determine effective hyperparameters for eight-layer models. The search space and optimal parameters were established after 50 optimization runs, with the best parameters achieving an accuracy of 61.85

•

$L_{1}$ activation penalty, optimal value $10^{-7}$
•

$L_{2}$ activation penalty, optimal value $10^{-6}$
•

$L_{1}$ weight decay, optimal value $0$
•

Dropout rate before the output layer, optimal value $0.1456351951990277$
•

Dropout/Noise Injection, Full placement, optimal value: Full
•

“Dropout type”: use either Dropout or Noise Injection inside ConvKAGN layers optimal value: Noise Injection
•

Width scale: parameter that expands the number of channels, optimal value $6$
•

Degree of Gram polynomials, optimal value $3$
•

Adam weights decay, optimal value $6.579785489783022\cdot 10^{-6}$
•

Learning rate, optimal value $0.000779538356958937$
•

Learning rate power: parameter that controls the learning rate scheduler, optimal value $1.1275350538654738$
•

Label smoothing, optimal value $0.1823706816166831$

The results of the models on this parameter set are presented in Table 3 and Fig.6.

The model trained with optimal parameters significantly outperforms the default configuration. Notably, increasing model depth adversely affects performance, which may be due to the extreme model size (nearly half a billion parameters) or issues related to vanishing gradients in very deep networks. However, several established methods, such as ResNet-like and DenseNet-like architectures, can mitigate these issues. The primary challenge remains managing the overwhelming number of parameters in these models. Addressing this challenge is essential for improving scalability and performance.

4.3 Bottleneck Kolmogorov-Arnold Convolutional on CIFAR100 and Tiny-Imagenet

4.3.1 CIFAR100

In this section, we conduct a series of experiments on the CIFAR-100 dataset using Bottleneck KAGN convolutional layers. The training parameters are based on the hyperparameter optimization results discussed in Section 4.2.3, all models were trained 200 full epochs. We utilize simple models with 8, 12, and 16 layers, described in Section 4.2.2, a tiny DenseNet model as described in [21] with all convolutional layers replaced by Bottleneck KAGN convolutions, and ResNet-like models, also with convolutional layers replaced by Bottleneck KAGN convolutional or KAGN convolutional layers. This approach aims to evaluate the performance and scalability of the Bottleneck KAGN layers across different architectures. We also provide results for the Mixture of Experts versions of BottleNeck KAGN convolutional layers, described in Section 3.2, with 2 active experts out of 8.

Tables 4 and 3 demonstrate that models using Bottleneck KAGN convolutions perform comparably to those using standard KAGN convolutions but with significantly fewer parameters. However, increasing the number of layers generally degrades model quality. The ResNet18 and ResNet34 variants illustrate that incorporating residual connections mitigates this issue, with ResNet34 outperforming ResNet18 for both KAGN and Bottleneck KAGN convolutions. This indicates that adopting ResNet-like and DenseNet-like approaches can effectively address the challenge of scaling models in depth.

Model	Accuracy
Dspike ResNet-18 [22]	74.24
ELU ResNet [4]	73.5
OTTT [23]	71.05
WaveMix-Lite-256/7 [24]	70.20
IM-Loss (VGG-16) [25]	70.18
ResNet18 (modified) [26]	66
ResKAGNet34 (Ours)	0.7814
ResKAGNet34 Bottleneck (Ours)	0.7711
ResKAGNet34 Bottleneck MoE (Ours)	0.7166
ResKAGNet18 (Ours)	0.7656
ResKAGNet18 Bottleneck (Ours)	0.7643
ResKAGNet18 Bottleneck MoE (Ours)	0.7164
Tiny DenseKAGNet (Ours)	0.7777
DenseKAGNet BottleNeck (Ours)	0.769
DenseKAGNet BottleNeck MoE (Ours)	0.7623
16 layers, BottleNeck MoE (Ours)	0.6953
16 layers, BottleNeck (Ours)	0.7065
12 layers, BottleNeck MoE (Ours)	0.7165
12 layers, BottleNeck (Ours)	0.7241
8 layers, BottleNeck MoE (Ours)	0.7439
8 layers, BottleNeck (Ours)	0.7483

Table 4: BottleneckKAGN Convolutional networks, CIFAR 100

Model	Accuracy
Tiny DenseNet [21]	60.0
ResNet-18 (AutoMix) [27]	67.33
ResNeXt-50(AutoMix) [27]	70.72
Tiny DenseKAGNet (Ours)	66.07
Tiny DenseKAGNet, BottleNeck (Ours)	65.9
Tiny DenseKAGNet, BottleNeck MoE (Ours)	64.82
VGG19-like, MoE (Ours)	63.72
VGG19-like (Ours)	48.59
VGG16-like, MoE (Ours)	64.27
VGG16-like (Ours)	52.55
VGG13-like, MoE (Ours)	64.57
VGG13-like (Ours)	52.57
VGG11-like, MoE (Ours)	63.22
VGG11-like (Ours)	51.22

Table 5: BottleneckKAGN Convolutional networks, Tiny Imagenet

4.3.2 Tiny ImageNet

In this section, we conduct a series of experiments on the Tiny ImageNet dataset. The training parameters are based on the hyperparameter optimization results discussed in 4.2.3, all models were trained 200 full epochs. We employ VGG-like models where hidden linear layers and the final MaxPool layer are replaced with a Global Average Pooling and a single output layer. Additionally, we use a tiny DenseNet model, as described in [21], with all convolutional layers replaced by Bottleneck KAGN convolutionals. We also perform experiments with Mixture of Experts (MoE) models featuring two active experts out of a total of eight.

The results, presented in Table 5, indicate that VGG-like models decrease in accuracy as the number of layers increases. In contrast, the MoE version shows significantly better performance on this dataset, suggesting that MoE layers can be an effective approach for scaling KAGN-based models in width.

Also, proposed models, VGG-like MoE and Tiny DenseNet, with Gram polynomials KAN convolutions outperform Tiny DenseNet [21] by a significant margin on the Tiny Imagenet dataset.

4.4 Self-KAGNtention

In this section, we empirically investigate the performance of BottleNeck SelfKAGNtention layers and BottleNeckKAGN Focal Modulation. We base our experiments on an architecture with eight convolutional layers, supplemented with three Self-Attention or Focal Modulation layers. The network (see Fig.7 and 8) begins with two BottleNeckKAGN convolution layers, each with 32 filters (256 filters in the Fixed version), both using a 3x3 kernel. This is followed by an attention layer. The next section consists of three BottleNeckKAGN convolution layers, each with 128 filters (256 filters in the Fixed version) and a 3x3 kernel, with the first layer using a stride of 2. Another attention layer is added after these layers. The subsequent section includes three BottleNeckKAGN convolution layers with 256 filters and a 3x3 kernel, with the first layer using a stride of 2. A third attention layer follows. The network concludes with a global max pooling layer and an output layer with 100 nodes.

We use the training and regularization parameters found in Section 4.2.3, categorical cross-entropy as the loss function, and the AdamW optimizer [28]. The models were trained for 150 epochs with a batch size of 256. For Self-Attention layers, we employ bottleneck convolutions with 16 and 64 filters in the first two attention layers to reduce memory and computational resource usage.

We explore two model versions: one with an increasing number of filters with depth and another with a constant number of filters, inspired by transformer architectures. The experimental results on CIFAR-100, presented in Table 6, and compared with Table 3, show that the combination of attention mechanisms with bottleneck convolutions outperforms models without attention while maintaining a significantly lower number of parameters.

Model	Width scale	Accuracy, CIFAR-100	Parameters, M
SelfKAGNtention	1	71.4	4.9
Fixed SelfKAGNtention	1	77.82	8.7
FocalKAGNtention	1	71.24	4.3
Fixed FocalKAGNtention	1	75.99	7.4
SelfKAGNtention	3	77.69	43.7
Fixed SelfKAGNtention	3	69.72	78.3
FocalKAGNtention	3	76.46	37.8
Fixed FocalKAGNtention	3	7467	63.2

Table 6: BottleneckKAGN Convolutional with Attention layers, CIFAR-100 dataset

4.5 Imagenet1k

In this section, we provide results on ImageNet1k [29] dataset. We have tested several VGG-like [30] models with several modifications. We replaced the last MaxPolling layers and two hidden fully connected layers by GlobalAveragePooling and one output fully connected layer. We also added two extra convolutional layers at the end of the encoder.

The model consists of consecutive 10 Gram ConvKAN Layers or Bottleneck Gram ConvKAN Layers with BatchNorm, polynomial degree equals 5, GlobalAveragePooling, and Linear classification head (see Fig.9 and 10). The network design starts with a KAGN convolution layer with 32 filters (3x3), followed by a max pooling layer (2x2). This pattern is repeated with a KAGN convolution layer with 64 filters (3x3) and another max pooling layer (2x2). The network then includes two consecutive KAGN convolution layers with 128 filters (3x3), followed by another max pooling layer (2x2). Next, there are two KAGN convolution layers with 256 filters (3x3), another max pooling layer (2x2), and two more KAGN convolution layers with 256 filters for the V2 version or 512 filters for the V4 version (3x3). The network concludes with a global average pooling layer and a dense output layer with 1000 nodes.

We also have tested a model with a Self KAN-attention layer, described in 3.3 placed before Global Average pooling.

All models, except one, were trained during 200 full epochs with AdamW optimizer, with learning rate equals 0.0009, $\beta_{1}=0.9,\beta_{2}=0.999$ , weight decay equals $5\cdot 10^{-6}$ , with 7500 warmup steps and polynomial learning rate scheduler with power $0.3$ and learning rate at the end $10^{-7}$ , and batch size equals 32. The model marked with "Opt params" tag was trained with the parameters discovered in Section 4.2.3, and a polynomial degree equals 3 for this model. The models marked with "BottleNeck" use BottleNeck convolutional layers, described in Section3.2.

The results of experiments are presented in Table 7. Please, note that the metrics of our experiments are computed on the validation set of ImageNet1k. Metrics on the test set will be added later. According to the acquired results, the VGG11-like model with bottleneck KAGN convolutions outperforms all models from the VGG family, ResNet-18, ResNet-34, and DenseNet121 models. This fact shows the great potential of the KAN convolutions model.

Model	Accuracy, Top-1	Accuracy, Top-5
VGG11 [30]	69.02	88.628
VGG11, w BatchNorm [30]	70.37	89.81
VGG13 [30]	69.928	89.246
VGG13, w BatchNorm [30]	71.586	90.374
VGG16 [30]	71.592	90.382
VGG16, w BatchNorm [30]	73.36	91.516
VGG19 [30]	72.376	90.876
VGG19, w BatchNorm [30]	74.218	91.842
ResNet18 [31]	69.758	89.078
ResNet34 [31]	73.314	91.42
ResNet50 [31]	76.13	92.862
DenseNet121 [5]	74.434	91.972
DenseNet161 [5]	77.138	93.56
VGG11-like, v2 (Ours)¹¹1The weights are available on HugginFace	59.1	82.29
VGG11-like, v4 (Ours)²²2The weights are available on HugginFace	61.17	83.26
VGG11-like, v4, BottleNeck (Ours)³³3The weights are available on HugginFace	68.50	88.46
VGG11-like, v4, BottleNeck, SA (Ours)⁴⁴4The weights are available on HugginFace	70.684	89.462
VGG11-like, v4, BottleNeck, Opt params (Ours)⁵⁵5The weights are available on HugginFace	74.586	92.13

Table 7: Results on the ImageNet 1K dataset

4.6 Parameter Efficient Finetuning

In this section, we use the HAM10000 ("Human Against Machine with 10000 training images") dataset [32] to explore the proposed PEFT method for Gram KAN convolution models. The dataset consists of 10015 dermatoscopic images. Cases include a representative collection of all important diagnostic categories in the realm of pigmented lesions: Actinic keratoses and intraepithelial carcinoma / Bowen’s disease, basal cell carcinoma, benign keratosis-like lesions, dermatofibroma, melanoma, melanocytic nevi, and vascular lesions. We use the train, validation, and test split hosted on HuggingFace: https://huggingface.co/datasets/marmal88/skin_cancer.

We selected this dataset to evaluate the PEFT method due to its requirement for model adaptation to a novel domain not encountered during pretraining on ImageNet1k. This characteristic makes it an exemplary dataset for our experiments, providing a scenario closely aligned with real-world use cases.

We evaluate several setups with the same training procedure and parameters (see Table 8). We use the VGG11-like, v2 model pretrained on ImageNet1K. All models were trained during 20 full epochs with AdamW optimizer, with learning rate equals 0.0001, $\beta_{1}=0.9,\beta_{2}=0.999$ , weight decay equals $5\cdot 10^{-6}$ , with 500 warmup steps and polynomial learning rate scheduler with power $0.3$ and learning rate at the end $10^{-6}$ , and batch size equals 12. Because of class imbalance in the dataset, we use Focal Loss [33] instead of categorical cross-entropy.

Setups with "Random weights initialization" value in the "Pretrained degree" column mark setups with random weights initialization with the same parameters as the VGG11-like, v2 model. The column "Trainable Degree" describes a set of trainable parameters of $\tilde{w}_{g}=\{w_{i}\}_{i=R}^{N+1},R>1$ , 6 means that extra polynomial was added to the basis. "Full training" implies that all weights of the encoder were updated during the training process. The "Trainable activation residual" column indicates whether or not $w_{b}$ weights of residual activation were trainable or not.

As we can see from Table 8, PEFT setup with trainable 4-th, 5-th and extra degree polynomial-related weights outperforms all other setups, showing the potential of the proposed method, including full weight fine-tuning. Also we should mention that fine-tuning activation residual weights in general leads to worse performance, so it’s safe to assume that during PEFT procedure these weights could remain frozen.

Training setup			Metrics
Pretrained degree	Trainable Degree	Finetune activation residual	Accuracy	AUC (OvR)	AUC (OvO)	Recall (macro)	Recall (micro)	$F_{1}$ score (macro)	$F_{1}$ score (micro)
5	4, 5, 6	FALSE	83.89	0.975	0.984	0.842	0.839	0.752	0.839
5	4	FALSE	82.72	0.977	0.983	0.838	0.827	0.777	0.827
5	4, 5, 6	TRUE	81.79	0.971	0.983	0.827	0.818	0.714	0.818
5	4, 5	FALSE	80.7	0.977	0.985	0.827	0.807	0.75	0.807
5	5	FALSE	80.16	0.974	0.98	0.802	0.802	0.758	0.802
5	4, 5	TRUE	77.28	0.966	0.974	0.769	0.773	0.68	0.773
5	Full Training	TRUE	76.89	0.976	0.985	0.842	0.769	0.729	0.769
5	5	TRUE	75.8	0.963	0.972	0.783	0.758	0.696	0.758
5	6	FALSE	74.32	0.941	0.94	0.634	0.743	0.636	0.743
5	3	FALSE	74.16	0.945	0.939	0.573	0.742	0.601	0.742
5	4	TRUE	73.07	0.967	0.981	0.84	0.731	0.69	0.731
Random weights initialization	Full Training	TRUE	71.67	0.958	0.976	0.837	0.717	0.67	0.717
5	2	FALSE	71.28	0.927	0.924	0.507	0.713	0.529	0.713
5	6	TRUE	69.18	0.938	0.947	0.69	0.692	0.639	0.692
5	5, 6	FALSE	67.78	0.936	0.938	0.55	0.678	0.54	0.678
5	2	TRUE	66.69	0.928	0.939	0.634	0.667	0.599	0.667
5	5, 6	TRUE	66.69	0.939	0.946	0.577	0.667	0.564	0.667
5	3	TRUE	58.13	0.944	0.955	0.678	0.581	0.622	0.581
5	Fixed encoder	FALSE	48.25	0.864	0.87	0.53	0.482	0.361	0.482

Table 8: PEFT results on HAM10000 dataset

4.7 Segmentation

In this section, we provide empirical evaluation for U-Net-like segmentation models based on KAGN Convolutional layers on BUSI, GlaS, and CVC-ClinicDB datasets.

The BUSI dataset [34] consists of ultrasound images depicting normal, benign, and malignant breast cancer cases, along with their corresponding segmentation maps. In our study, we utilized 647 ultrasound images representing both benign and malignant breast tumors, all consistently resized to 256 × 256 pixels. This dataset provides a comprehensive collection of images that assist in detecting and differentiating various types of breast tumors, offering valuable insights for medical professionals and researchers.

The GlaS dataset [35] includes 612 Standard Definition (SD) frames from 31 sequences, each with a resolution of 384 × 288 pixels, collected from 23 patients. This dataset is associated with the Hospital Clinic in Barcelona, Spain. The sequences were recorded using Olympus Q160AL and Q165L devices, paired with an Extra II video processor. For our study, we specifically used 165 images from the GlaS dataset, all resized to 512 × 512 pixels.

The CVC-ClinicDB dataset [35], also known as "CVC," is a publicly accessible resource for polyp diagnosis within colonoscopy videos. It comprises 612 images, each with a resolution of 384 × 288 pixels, extracted from 31 distinct colonoscopy sequences. These frames offer a diverse array of polyp instances, making them particularly useful for develo** and evaluating polyp detection algorithms. To ensure consistency across different datasets in our study, all images from the CVC-ClinicDB dataset were uniformly resized to 256 × 256 pixels.

We explore 3 different U-net-like models: first is a U-net with convolutions replaced by KAGN Convolutional layers, $U^{2}$ -net-like model [12], again with convolutions replaced by KAGN Convolutional layers, and $U^{2}$ -net small model, were all filter number were the same across all hidden layers and equal 16 multiplied by the width scale parameter.

For the BUSI, GlaS, and CVC datasets, the batch size was set to 4 and the learning rate was 1e-4, all other parameters were the same as discussed in Section 4.2.3. The loss function was chosen to be a combination of binary cross entropy and dice loss. We randomly split each dataset into 80% training and 20% validation subsets.

The results are presented in Table 9. Our approach outperforms all state-of-the-art methods, including the novel U-KAN model [13]. Qualitative results of the models are provided in Fig.11.

Methods	BUSI [34]		GlaS [35]		CVC [36]
Methods	IoU↑	F1↑	IoU↑	F1↑	IoU↑	F1↑
U-Net [6]	57.22	71.91	86.66	92.79	83.79	91.06
Att-Unet [37]	55.18	70.22	86.84	92.89	84.52	91.46
U-Net++ [10]	57.41	72.11	87.07	92.96	84.61	91.53
U-NeXt [38]	59.06	73.08	84.51	91.55	74.83	85.36
Rolling-UNet [39]	61.00	74.67	86.42	92.63	82.87	90.48
U-Mamba [40]	61.81	75.55	87.01	93.0	84.79	91.63
U-KAN [13]	63.38	76.40	87.64	93.37	85.05	91.88
$U^{2}$ -KAGNet, Small (Ours)	59.44	74.56	89.13	94.25	85.62	92.26
$U^{2}$ -KAGNet, BottleNeck, Small (Ours)	62.95	77.26	85.99	92.47	87.61	93.39
$U^{2}$ -KAGNet, BottleNeck (Ours)	55.91	71.72	87.73	93.47	88.86	94.1
$U^{2}$ -KAGNet (Ours)	58.13	73.52	88.14	93.7	87.07	93.09
UKAGNet (Ours)	63.45	77.64	87.31	93.23	76.85	86.91

Table 9: Comparison with state-of-the-art segmentation models on three heterogeneous medical datasets.

5 Ablation Study

In this section, we present the results of an ablation study of the Bottleneck Kolmogorov-Arnold convolutional layers with Gram polinomials. The baseline sequence for convolutional KAN models includes an activation residual summed with a nonlinearity $\tilde{\varphi}$ , followed by a normalization layer (e.g., batch norm, instance norm), and a SiLU activation. The ablation experiments involve excluding one or more elements from this sequence: activation residual, normalization layer, nonlinearity, and replacing the linear bottleneck with a KAN-based bottleneck layer. Experiments were conducted on the MNIST [18], CIFAR-10, CIFAR-100, and Fashion MNIST [41] datasets, and results are shown in Table 10.

The results show that except for MNIST, the KAN-based bottleneck leads to training collapse or significant accuracy degradation, and using activation residual degrades performance in half of the cases. This suggests that activation residual may be redundant in some cases and warrants further investigation. Nevertheless, we propose retaining the convolutional layer scheme with activation residual, as the bottleneck approach can lead to information loss, and activation residual can help in its recovery.

Model	Setup	Accuracy, MNIST	Accuracy, Fashion-MNIST	Accuracy, CIFAR-10	Accuracy, CIFAR-100
8 layers	activation skip, linear bottleneck	0.1135	0.8948	0.7392	0.01
8 layers	activation skip, linear bottleneck, activation	0.1135	0.1	0.1	0.01
8 layers	activation skip, KAN bottleneck, batch norm, activation	0.1135	0.1	0.1	0.01
8 layers	linear bottleneck, batch norm, activation	0.9967	0.9424	0.8999	0.6672
8 layers	activation skip, linear bottleneck, batch norm, activation	0.9969	0.9348	0.8923	0.6489
4 layers	activation skip, linear bottleneck	0.9552	0.815	0.5438	0.2614
4 layers	activation skip, linear bottleneck, activation	0.9858	0.83	0.626	0.2801
4 layers	activation skip, KAN bottleneck, batch norm, activation	0.9949	0.8879	0.7447	0.4417
4 layers	linear bottleneck, batch norm, activation	0.9935	0.9108	0.8021	0.4921
4 layers	activation skip, linear bottleneck, batch norm, activation	0.9941	0.9016	0.7962	0.4951

Table 10: Ablation study results

6 Design Principles

Summarizing the experiments conducted, we propose the following preliminary design principles for Kolmogorov-Arnold convolutional networks. It is important to note that our experiments were not exhaustive, and these principles may be revised with new data.

•

We recommend using Gram polynomials for $\tilde{\varphi}(x)$ . Our experiments indicate this choice excels in both quality metrics and the number of trainable parameters.
•

For scaling Kolmogorov-Arnold convolution-based models, we suggest using the bottleneck version of the layers, which significantly reduces the number of trainable parameters without substantial loss in performance compared to the non-bottleneck version.
•

Increasing model width generally performs better than increasing depth, as shown by our experiments with simple sequential models. The Mixture of Experts versions of the bottleneck convolution effectively scales model width without a significant increase in inference and training costs.
•

Preliminary findings suggest that DenseNet-like architectures could serve as a strong foundation for constructing very deep Kolmogorov-Arnold convolutional networks.
•

Our experiments demonstrate that Self KAGNtention layers can enhance the performance of Kolmogorov-Arnold convolutional models.
•

Kolmogorov-Arnold convolutions perform exceptionally well in segmentation tasks, with U2Net recommended as a starting architecture for further research.
•

As shown in Section 4.2, $L_{1}$ and $L_{2}$ activation regularization, and Noise Injection layers before the Kolmogorov-Arnold convolutional layer are effective regularization techniques.

7 Conclusion

This paper explores the integration of Kolmogorov-Arnold Networks (KANs) into convolutional neural network architectures, presenting novel approaches and modifications to enhance their performance and efficiency in computer vision tasks. Our work introduces Bottleneck Convolutional Kolmogorov-Arnold layers, a parameter-efficient design that reduces memory requirements and mitigates overfitting issues. Additionally, we propose a parameter-efficient fine-tuning algorithm that significantly decreases the number of trainable parameters needed for adapting pre-trained models to new tasks.

Through extensive empirical evaluations on various datasets, including MNIST, CIFAR10, CIFAR100, Tiny ImageNet, ImageNet1k, HAM10000, BUSI, GlaS, and CVC-ClinicDB, we demonstrate that KAN-based convolutional models can achieve state-of-the-art results in both classification and segmentation tasks. Our experiments highlight the effectiveness of Gram polynomials as the basis function for KANs, the advantages of scaling model width over depth, and the potential of DenseNet-like architectures for very deep networks.

We further show that incorporating Self KAGNtention layers enhances model performance, particularly in complex tasks, and provide design principles for constructing successful KAN convolutional models. Our proposed models not only outperform traditional convolutional networks but also offer a promising direction for future research in optimizing neural network architectures for computer vision applications.

Overall, our findings emphasize the potential of Kolmogorov-Arnold Networks in advancing the capabilities of convolutional neural networks, paving the way for more efficient and effective deep learning models. Future work will focus on refining these approaches and exploring their applications in other domains, as well as investigating additional regularization techniques and optimization strategies to further enhance model performance.

References

[1] Yann LeCun, Koray Kavukcuoglu, and Clement Farabet. Convolutional networks and applications in vision. In Proceedings of 2010 IEEE International Symposium on Circuits and Systems, pages 253–256, 2010.
[2] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In F. Pereira, C.J. Burges, L. Bottou, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems, volume 25. Curran Associates, Inc., 2012.
[3] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021.
[4] Anish Shah, Eashan Kadam, Hena Shah, Sameer Shinde, and Sandip Shingade. Deep residual networks with exponential linear unit. In Proceedings of the Third International Symposium on Computer Vision and the Internet, VisionNet’16, page 59–65, New York, NY, USA, 2016. Association for Computing Machinery.
[5] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q. Weinberger. Densely connected convolutional networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2261–2269, 2017.
[6] Olaf "Ronneberger, Philipp Fischer, and Thomas" Brox. "u-net: Convolutional networks for biomedical image segmentation". In Nassir "Navab, Joachim Hornegger, William M. Wells, and Alejandro F." Frangi, editors, "Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015", pages "234–241", "Cham", "2015". "Springer International Publishing".
[7] Ziming Liu, Yixuan Wang, Sachin Vaidya, Fabian Ruehle, James Halverson, Marin Soljačić, Thomas Y. Hou, and Max Tegmark. Kan: Kolmogorov-arnold networks, 2024.
[8] Subhransu S. Bhattacharjee. Torchkan: Simplified kan model with variations. https://github.com/1ssb/torchkan/, 2024.
[9] Alexander Dylan Bodner, Antonio Santiago Tepsich, Jack Natan Spolski, and Santiago Pourteau. Convolutional kolmogorov-arnold networks, 2024.
[10] Zongwei Zhou, Md Mahfuzur Rahman Siddiquee, Nima Tajbakhsh, and Jianming Liang. Unet++: A nested u-net architecture for medical image segmentation. In Deep learning in medical image analysis and multimodal learning for clinical decision support, pages 3–11. Springer, Cham, 2018.
[11] Jieneng Chen, Yongyi Lu, Qihang Yu, Xiangde Luo, Ehsan Adeli, Yan Wang, Le Lu, Alan L. Yuille, and Yuyin Zhou. Transunet: Transformers make strong encoders for medical image segmentation, 2021.
[12] Xuebin Qin, Zichen Zhang, Chenyang Huang, Masood Dehghan, Osmar R. Zaiane, and Martin Jagersand. U2-net: Going deeper with nested u-structure for salient object detection. Pattern Recognition, 106:107404, 2020.
[13] Chenxin Li, Xinyu Liu, Wuyang Li, Cheng Wang, Hengyu Liu, and Yixuan Yuan. U-kan makes strong backbone for medical image segmentation and generation, 2024.
[14] Tawan. Gram: Kan meets gram polynomials. https://github.com/Khochawongwat/GRAMKAN, 2024.
[15] Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer, 2017.
[16] Hai** Wu, Bin Xiao, Noel Codella, Mengchen Liu, Xiyang Dai, Lu Yuan, and Lei Zhang. Cvt: Introducing convolutions to vision transformers. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 22–31, 2021.
[17] Jianwei Yang, Chunyuan Li, Xiyang Dai, and Jianfeng Gao. Focal modulation networks. Advances in Neural Information Processing Systems, 35:4203–4217, 2022.
[18] Li Deng. The mnist database of handwritten digit images for machine learning research. IEEE Signal Processing Magazine, 29(6):141–142, 2012.
[19] Alex Krizhevsky. Learning multiple layers of features from tiny images. 2009.
[20] Ya Le and Xuan S. Yang. Tiny imagenet visual recognition challenge. 2015.
[21] Zoheb Abai and Nishad Rajmalwar. Densenet models for tiny imagenet classification. ArXiv, abs/1904.10429, 2019.
[22] Yuhang Li, Yufei Guo, Shanghang Zhang, Shikuang Deng, Yongqing Hai, and Shi Gu. Differentiable spike: Rethinking gradient-descent for training spiking neural networks. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 23426–23439. Curran Associates, Inc., 2021.
[23] Mingqing Xiao, Qingyan Meng, Zongpeng Zhang, Di He, and Zhouchen Lin. Online training through time for spiking neural networks. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022.
[24] Pranav Jeevan, Kavitha Viswanathan, Anandu A S, and Amit Sethi. Wavemix: A resource-efficient neural network for image analysis, 2024.
[25] Yufei Guo, Yuanpei Chen, Liwen Zhang, Xiaode Liu, Yinglei Wang, Xuhui Huang, and Zhe Ma. Im-loss: Information maximization loss for spiking neural networks. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 156–166. Curran Associates, Inc., 2022.
[26] Riad Ibadulla, Thomas M. Chen, and Constantino Carlos Reyes-Aldasoro. Fatnet: High-resolution kernels for classification using fully convolutional optical neural networks. AI, 4(2):361–374, 2023.
[27] Zicheng "Liu, Siyuan Li, Di Wu, Zihan Liu, Zhiyuan Chen, Lirong Wu, and Stan Z." Li. "automix: Unveiling the power of mixup for stronger classifiers". In Shai "Avidan, Gabriel Brostow, Moustapha Cissé, Giovanni Maria Farinella, and Tal" Hassner, editors, "Computer Vision – ECCV 2022", pages "441–458", "Cham", "2022". "Springer Nature Switzerland".
[28] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In International Conference on Learning Representations, 2017.
[29] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael S. Bernstein, Alexander C. Berg, and Li Fei-Fei. Imagenet large scale visual recognition challenge. CoRR, abs/1409.0575, 2014.
[30] K Simonyan and A Zisserman. Very deep convolutional networks for large-scale image recognition. pages 1–14. Computational and Biological Learning Society, 2015.
[31] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016.
[32] Philipp Tschandl, Cliff Rosendahl, and Harald Kittler. The ham10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions. Scientific Data, 5(1), August 2018.
[33] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. Focal loss for dense object detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(2):318–327, 2020.
[34] Walid Al-Dhabyani, Mohammed Gomaa, Hussien Khaled, and Aly Fahmy. Dataset of breast ultrasound images. Data in brief, 28:104863, 2020.
[35] Jeya Maria Jose Valanarasu, Poojan Oza, Ilker Hacihaliloglu, and Vishal M Patel. Medical transformer: Gated axial-attention for medical image segmentation. In Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part I 24, pages 36–46. Springer, 2021.
[36] Jorge Bernal, F Javier Sánchez, Gloria Fernández-Esparrach, Debora Gil, Cristina Rodríguez, and Fernando Vilariño. Wm-dova maps for accurate polyp highlighting in colonoscopy: Validation vs. saliency maps from physicians. Computerized medical imaging and graphics, 43:99–111, 2015.
[37] Ozan Oktay, Jo Schlemper, Loic Le Folgoc, Matthew Lee, Mattias Heinrich, Kazunari Misawa, Kensaku Mori, Steven McDonagh, Nils Y Hammerla, Bernhard Kainz, et al. Attention u-net: Learning where to look for the pancreas. arXiv preprint arXiv:1804.03999, 2018.
[38] Jeya Maria Jose Valanarasu and Vishal M Patel. Unext: Mlp-based rapid medical image segmentation network. In International conference on medical image computing and computer-assisted intervention, pages 23–33. Springer, 2022.
[39] Yutong Liu, Haijiang Zhu, Mengting Liu, Huaiyuan Yu, Zihan Chen, and Jie Gao. Rolling-unet: Revitalizing mlp’s ability to efficiently extract long-distance dependencies for medical image segmentation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 3819–3827, 2024.
[40] Jun Ma, Feifei Li, and Bo Wang. U-mamba: Enhancing long-range dependency for biomedical image segmentation. arXiv preprint arXiv:2401.04722, 2024.
[41] Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms, 2017.

Val. Accuracy	Model	Parameters
74.87	8 layers	203.59M
72.48	12 layers	389.39M
67.28	16 layers	477.83M