¹¹institutetext: Stony Brook University, Stony Brook NY 11794, USA
¹¹email: [email protected]

MeGA: Merging Multiple Independently Trained Neural Networks Based on Genetic Algorithm

Daniel Yun 11

Abstract

In this paper, we introduce a novel method for merging the weights of multiple pre-trained neural networks using a genetic algorithm called MeGA. Traditional techniques, such as weight averaging and ensemble methods, often fail to fully harness the capabilities of pre-trained networks. Our approach leverages a genetic algorithm with tournament selection, crossover, and mutation to optimize weight combinations, creating a more effective fusion. This technique allows the merged model to inherit advantageous features from both parent models, resulting in enhanced accuracy and robustness. Through experiments on the CIFAR-10 dataset, we demonstrate that our genetic algorithm-based weight merging method improves test accuracy compared to individual models and conventional methods. This approach provides a scalable solution for integrating multiple pre-trained networks across various deep learning applications. Github is available at: Here

Keywords:

Neural Networks Deep Learning Genetic Algorithm Evolutionary Algorithms

1 Introduction

In recent years, deep learning has achieved state-of-the-art performance across various tasks, such as image classification and natural language processing, largely due to the use of pre-trained neural networks [8, 21, 27]. These networks can be fine-tuned for specific tasks, saving computational resources and time. However, effectively combining multiple pre-trained models of the same architecture to harness their collective strengths and mitigate individual weaknesses remains a significant challenge.

Merging weights from different pre-trained models with identical architectures is crucial because individual models often learn complementary features from the data [11, 3, 23]. Combining these models can create a more robust and accurate system, leveraging the strengths of each model and leading to improved performance. Additionally, model fusion can help achieve better generalization by averaging out individual biases and reducing overfitting [3, 20, 31].

Training models in parallel rather than sequentially offers efficiency and practicality. Multiple models can learn diverse patterns quickly, and merging their weights can achieve superior performance faster [14, 2, 10]. This approach is particularly advantageous in distributed learning environments, facilitating scalable and efficient training across multiple GPUs or devices [2, 22].

The primary challenge in weight merging is finding the optimal combination that maximizes performance without further training or altering the model’s architecture. Simple averaging fails to account for the intricate dependencies within the neural network [16, 5, 30]. To address this, we propose a novel approach using a genetic algorithm to optimize weight merging. Genetic algorithms efficiently search large, complex spaces by iteratively selecting, combining, and mutating weights [25, 13, 7].

In this paper, we introduce our genetic algorithm-based method, MeGA, for merging the weights of multiple trained CNN models. Experiments on the CIFAR-10 dataset [18] demonstrate improvements in test accuracy compared to individual models and traditional techniques. Rather than focusing on comparative analysis, this research emphasizes the methodology of effectively merging models to leverage their collective strengths. Our results highlight the potential of genetic algorithms in optimizing neural network weight fusion, providing a scalable solution for integrating multiple pre-trained models in various deep learning applications. Additionally, we demonstrate that it is possible to merge the weights of neural networks that have been initialized differently and trained independently, further underscoring the flexibility and robustness of our proposed method.

2 Related Works

The fusion of neural network models has been an active area of research due to its potential to enhance model performance by leveraging the strengths of individual models. Several methods have been proposed to address the challenges associated with model merging, each with unique approaches and varying degrees of success.

Weight Merge. One of the earliest and simplest methods for model fusion is weight averaging, where the weights of two or more models are averaged to create a new model. This approach, while straightforward, often fails to consider the complex interactions between different layers of the networks. Goodfellow et al. introduced an approach that averages the weights of neural networks trained with different initializations, showing marginal improvements in performance [9]. Similarly, ensemble methods, where the predictions of multiple models are combined, have been extensively studied and are known to improve model robustness and accuracy [19]. However, these methods do not directly merge the models’ internal representations, potentially limiting their effectiveness. Izmailov et al. proposed Stochastic Weight Averaging (SWA), which improves generalization by averaging weights along the trajectory of SGD with a cyclical or constant learning rate [16]. Welling and Teh applied Stochastic Gradient Langevin Dynamics (SGLD) to sample from the posterior distribution of model weights, thereby integrating Bayesian principles into model merging [31]. This method helps in capturing the model uncertainty but is computationally intensive and complex to implement. Huang et al. introduced the concept of Snapshot Ensembles, where a single neural network is trained with a cyclical learning rate schedule, and multiple snapshots of the model are taken at different local minima [14]. Shen and Kong [28] demonstrated the use of genetic algorithms to enhance prediction accuracy by optimizing weights in neural network ensembles.

Genetic Algorithms in Neural Networks. Genetic algorithms have been applied in various aspects of neural network optimization. Real et al. utilized evolutionary algorithms for neural architecture search, demonstrating the potential of genetic approaches in discovering optimal network structures [26]. Similarly, Xie and Yuille used genetic algorithms for model pruning, showing that evolutionary techniques can effectively optimize neural network parameters [32]. Stanley and Miikkulainen introduced NeuroEvolution of Augmenting Topologies (NEAT), which evolves both the architecture and weights of neural networks, leading to improved performance on complex tasks [29]. Another notable application is by Fernando et al., who proposed PathNet, a method that uses evolutionary strategies to discover pathways through a neural network, enabling efficient transfer learning [4]. Finally, Loshchilov and Hutter applied a genetic algorithm for hyperparameter optimization, showing that evolutionary strategies can outperform traditional grid and random search methods [24].

Building on the success of these methods, our approach leverages genetic algorithms to optimize the weight merging process of pre-trained CNN models. By iteratively selecting, combining, and mutating weights, our method systematically explores the weight space to discover an optimal or near-optimal set of weights. This novel approach not only enhances model performance but also facilitates efficient parallel training and distributed learning across multiple devices, addressing scalability and robustness issues inherent in traditional methods.

3 Methodology

In this section, we describe the methodology used to merge the weights of two pre-trained neural network models using a genetic algorithm. We call our methodology, MeGA. Genetic algorithms provide a robust optimization framework by mimicking natural selection, making them suitable for complex tasks like weight merging [25].

Our approach aligns with the lottery ticket hypothesis, which suggests that neural networks contain critical weights (winning tickets) necessary for maintaining and enhancing performance [6]. In our MeGA algorithm, child models inherit and preserve these beneficial weight configurations from their parent models. By iteratively selecting the best-performing individuals as parents, the algorithm ensures that critical weights are retained and combined, allowing child models to evolve and enhance overall performance. This method enables us to effectively navigate the weight space, merging multiple pre-trained models into a single, superior model that leverages the strengths of each original network.

3.1 Genetic Algorithm Framework

The process of merging two neural networks using a genetic algorithm is illustrated in Figure 1. The initial population is created by combining the element-wise weights from two pre-trained networks. Each individual in the population represents a potential solution with a unique combination of weights. The genetic algorithm iteratively selects the best-performing individuals as parents based on their evaluation, performs crossover to generate new children, and applies mutation to introduce variability. This process continues during several generations, resulting in a single merged network that combines the strengths of both original networks.

Refer to caption — Figure 1: MeGA: The process of merging two neural networks using a genetic algorithm.

Genetic algorithms operate on a population of potential solutions, iteratively improving them through the processes of selection, crossover, and mutation [25]. Each individual in the population represents a candidate solution, and its fitness is evaluated based on a predefined objective function. The algorithm proceeds by selecting individuals with higher fitness to produce offspring, combining their traits through crossover, and introducing random variations through mutation. Over successive generations, the population evolves towards better solutions [25].

Algorithm 1 Genetic Algorithm for Weight Merging (Element-wise)

N

: Population size

G

: Number of generations

K

: Number of parents for tournament selection

p_{\text{mut}}

: Mutation probability

\sigma

: Standard deviation for Gaussian noise in mutation

\boldsymbol{\theta}_{1}

: Weights of the first pre-trained model

\boldsymbol{\theta}_{2}

: Weights of the second pre-trained model

0: Optimal weights

\boldsymbol{\theta}^{*}

1: Initialization:

2: Initialize population

\mathcal{P}

with

N

individuals where each individual

\boldsymbol{\theta}_{i}

is a linear combination of

\boldsymbol{\theta}_{1}

and

\boldsymbol{\theta}_{2}

3: for each individual

n

\mathcal{P}

\alpha\sim\text{Uniform}(0,1)

\theta_{i,j,k}=\alpha\theta_{1,j,k}+(1-\alpha)\theta_{2,j,k}\quad\forall j,k

6: end for

7: Genetic Algorithm Execution:

8: for generation = 1 to

G

9: Fitness Evaluation: Evaluate fitness

F(\boldsymbol{\theta}_{n})

for each individual in

\mathcal{P}

10: for each individual

i

\mathcal{P}

11:

F(\boldsymbol{\theta}_{i})=\text{Accuracy}(f(\mathbf{X}_{\text{val}};% \boldsymbol{\theta}_{i}),\mathbf{y}_{\text{val}})

12: end for

13: Selection: Select

K

parents using tournament selection

14: for each selection round from 1 to

K

15: Randomly select

t

individuals to form tournament set

\mathcal{T}

16:

\mathcal{T}=\{\boldsymbol{\theta}_{1},\boldsymbol{\theta}_{2},\ldots,% \boldsymbol{\theta}_{t}\}

17: Evaluate the fitness

F(\boldsymbol{\theta})

of each individual in the tournament set

18: Select the individual with the highest fitness as a parent:

\boldsymbol{\theta}_{\text{parent}}=\arg\max_{\boldsymbol{\theta}\in\mathcal{T% }}F(\boldsymbol{\theta})

19: Add

\boldsymbol{\theta}_{\text{parent}}

to the selected parents set

20: end for

21: Crossover: Generate offspring through element-wise crossover of selected parents

22: for each pair of parents

(\boldsymbol{\theta}_{\text{parent1}},\boldsymbol{\theta}_{\text{parent2}})

in selected parents do

23:

\alpha\sim\text{Uniform}(0,1)

24:

\theta_{\text{child},j,k}=\alpha\theta_{\text{parent1},j,k}+(1-\alpha)\theta_{% \text{parent2},j,k}\quad\forall j,k

25: Add

\boldsymbol{\theta}_{\text{child}}

to the offspring set

26: end for

27: Mutation: Apply mutation to offspring

28: for each individual

\boldsymbol{\theta}_{\text{child}}

in offspring do

29: for each weight element

\theta_{j,k}

\boldsymbol{\theta}_{\text{child}}

30: if random number

<p_{\text{mut}}

then

31:

\theta_{j,k}=\theta_{j,k}+\epsilon

, where

\epsilon\sim\mathcal{N}(0,\sigma^{2})

32: end if

33: end for

34: end for

35: Population Update: Form new population

\mathcal{P}^{\prime}

with the best individual from

\mathcal{P}

(elitism) and newly generated offspring

36: Keep the best individual from

\mathcal{P}

based on fitness

37:

\mathcal{P}\leftarrow\mathcal{P}^{\prime}

38: Update the best individual

\boldsymbol{\theta}^{*}

if a better fitness is found in the current generation

39: end for

40: Output: Best individual

\boldsymbol{\theta}^{*}

with the highest fitness

3.1.1 Preliminaries.

Let $\mathbf{X}\in\mathbb{R}^{n\times d}$ be the training data, where $n$ is the number of samples and $d$ is the dimensionality of each sample. The corresponding labels are denoted by $\mathbf{y}\in\mathbb{R}^{n}$ . We define two pre-trained neural networks, $f_{1}(\mathbf{X};\boldsymbol{\theta}_{1})$ and $f_{2}(\mathbf{X};\boldsymbol{\theta}_{2})$ , where $\boldsymbol{\theta}_{1}$ and $\boldsymbol{\theta}_{2}$ represent the weights of the respective models. Our objective is to merge these weights to create a new model $f(\mathbf{X};\boldsymbol{\theta})$ that achieves superior performance.

3.1.2 Algorithm Execution.

The genetic algorithm iterates through $G$ generations, performing selection, crossover, and mutation at each step. The best individual $\boldsymbol{\theta}^{*}$ with the highest fitness across all generations is selected as the final set of weights for the fused model. The overall procedure is summarized as follows:

1.

Initialization: Create an initial population $\mathcal{P}$ of $N$ individuals.
2.

Fitness Evaluation: Evaluate the fitness $F(\boldsymbol{\theta})$ for each individual in $\mathcal{P}$ .
3.

Selection: Select $K$ parents from $\mathcal{P}$ using tournament selection.
4.

Crossover: Generate offspring through crossover of selected parents.
5.

Mutation: Apply mutation to the offspring.
6.

Population Update: Form the new population $\mathcal{P}^{\prime}$ with the best individual from $\mathcal{P}$ (elitism) and the newly generated offspring.
7.

Iteration: Repeat steps 2-6 for $G$ generations.

3.1.3 Initialization

The initial population $\mathcal{P}$ consists of $N$ individuals, where each individual represents a potential solution in the form of a set of weights $\boldsymbol{\theta}$ . Each individual is initialized to ensure diversity, which is crucial for the effectiveness of the genetic algorithm.

The weights of each individual $\boldsymbol{\theta}_{i}$ are initialized as an element-wise linear combination of the weights from two pre-trained models, $\boldsymbol{\theta}_{1}$ and $\boldsymbol{\theta}_{2}$ :

\theta_{i,j,k}=\alpha\theta_{1,j,k}+(1-\alpha)\theta_{2,j,k},\quad\alpha\sim% \text{Uniform}(0,1)

Here, $\alpha$ is a random scalar drawn from a uniform distribution between 0 and 1. This ensures that each individual’s weights are a unique blend of the two parent models, with the combination applied element-wise to preserve the detailed characteristics of both models.

3.1.4 Fitness Evaluation

The fitness of each individual in the population is assessed based on the validation accuracy on a hold-out validation set $(\mathbf{X}_{\text{val}},\mathbf{y}_{\text{val}})$ . For an individual with weights $\boldsymbol{\theta}_{i}$ , the fitness function $F(\boldsymbol{\theta}_{i})$ is defined as follows:

F(\boldsymbol{\theta}_{i})=\text{Accuracy}(f(\mathbf{X}_{\text{val}};% \boldsymbol{\theta}_{i}),\mathbf{y}_{\text{val}})

This function measures how accurately the model with weights $\boldsymbol{\theta}_{i}$ predicts the labels of the validation set. The model $f(\mathbf{X};\boldsymbol{\theta}_{i})$ is evaluated, and the accuracy is computed as the proportion of correctly classified samples in $\mathbf{X}_{\text{val}}$ .

Mathematically, the accuracy is given by:

\text{Accuracy}(f(\mathbf{X};\boldsymbol{\theta}),\mathbf{y})=\frac{1}{n_{% \text{val}}}\sum_{i=1}^{n_{\text{val}}}\mathbb{I}\left(\arg\max_{j}f_{j}(% \mathbf{X}_{i};\boldsymbol{\theta})=y_{i}\right)

where $n_{\text{val}}$ is the number of validation samples, $f_{j}(\mathbf{X}_{i};\boldsymbol{\theta})$ is the predicted probability for class $j$ for the $i$ -th validation sample, and $\mathbb{I}$ is the indicator function that equals 1 if the predicted class matches the true label $y_{i}$ , and 0 otherwise.

3.1.5 Selection

Tournament selection is employed to choose $K$ parents for the next generation. This method ensures that individuals with higher fitness are more likely to be selected, promoting desirable traits in subsequent generations. The selection process involves several steps.

First, we randomly select $t$ individuals from the current population $\mathcal{P}$ to form a tournament set $\mathcal{T}$ :

\mathcal{T}=\{\boldsymbol{\theta}_{i_{1}},\boldsymbol{\theta}_{i_{2}},\ldots,% \boldsymbol{\theta}_{i_{t}}\},\quad\mathcal{T}\subseteq\mathcal{P}

Next, the fitness $F(\boldsymbol{\theta})$ of each individual $\boldsymbol{\theta}$ in the tournament set $\mathcal{T}$ is evaluated using the previously defined fitness function. After evaluating the fitness of all individuals in the tournament set, the individual with the highest fitness is selected as a parent:

\boldsymbol{\theta}_{\text{parent}}=\arg\max_{\boldsymbol{\theta}\in\mathcal{T% }}F(\boldsymbol{\theta})

This process is repeated until $K$ parents are selected for crossover.

3.1.6 Crossover

Crossover is performed on pairs of selected parents to generate offspring. This process combines the genetic material (weights) of two parents to produce a new individual (offspring), thereby promoting genetic diversity. Given two parents $\boldsymbol{\theta}_{a}$ and $\boldsymbol{\theta}_{b}$ , an offspring $\boldsymbol{\theta}_{\text{child}}$ is created by taking a weighted combination of the parents’ weights. The crossover is performed element-wise for each weight:

\theta_{\text{child},j,k}=\beta\theta_{a,j,k}+(1-\beta)\theta_{b,j,k},\quad% \beta\sim\text{Uniform}(0,1)

3.1.7 Mutation

Mutation introduces variability into the population by perturbing the weights of the offspring. For each weight $\theta_{j,k}$ in the offspring $\boldsymbol{\theta}_{\text{child}}$ , mutation is applied with a probability $p_{\text{mut}}$ :

\theta_{child,j,k}\leftarrow\theta_{child,j,k}+\epsilon,\quad\epsilon\sim% \mathcal{N}(0,\sigma^{2})

Here, $\epsilon$ is a random variable drawn from a normal distribution with mean 0 and variance $\sigma^{2}$ . The mutation rate $p_{\text{mut}}$ controls the likelihood of each weight being mutated. This mutation process ensures that the population maintains genetic diversity by introducing new genetic material.

4 Experimental Results

4.1 Experimental Settings

We utilized the CIFAR-10 dataset [18]. The CIFAR-10 training dataset was split into 45,000 training and 5,000 validation sets. The validation set was used during the model validation and merging process. We use CNN models: ResNet [12], Xception [1], and DenseNet [15] without pre-trained weights and without data augmentation. These models were trained using a batch size of 256 over 50 epochs with the Adam optimizer [17], using a learning rate of 0.01.

Each model followed the same architecture and hyperparameters to ensure consistency and a fair comparison. After training, we applied our MeGA approach to merge the weights of these models into a single set of weights. The performance of the merged model was then evaluated and compared against the individual models. For all experiments, we used the following MeGA hyperparameters: a population size of 20, 20 generations, 4 parents per generation, and a mutation rate of 0.02.

4.2 CIFAR-10 Results

The results of the image classification experiments on the CIFAR-10 dataset [18] demonstrate the effectiveness of our MeGA.

Table 1: Performance Comparison of Individual and Merged CNN Models based on MeGA with CIFAR-10 Dataset

Datasets Model (# of Params.) Baseline. Baseline Test Acc. Merging Method Final Test Acc. # of classes (Random Seed) (Single Model) (Merged Model) ResNet 56 (0.86M) [12] Model 1 (56) 0.809 Weight Average 0.010 Model 2 (57) 0.793 MeGA (Ours) 0.822 ResNet 110 (1.78M) [12] Model 1 (56) 0.801 Weight Average 0.010 Model 2 (57) 0.814 MeGA (Ours) 0.816 ResNet 152 (3.51M) [12] Model 1 (56) 0.783 Weight Average 0.010 CIFAR-10 [18] Model 2 (57) 0.781 MeGA (Ours) 0.819 (10) Xception (22.96M) [12] Model 1 (56) 0.716 Weight Average 0.010 Model 2 (57) 0.730 MeGA (Ours) 0.754 DenseNet 121 (7.04M) [12] Model 1 (56) 0.719 Weight Average 0.010 Model 2 (57) 0.728 MeGA (Ours) 0.742 DenseNet 169 (12.65M) [12] Model 1 (56) 0.735 Weight Average 0.010 Model 2 (57) 0.731 MeGA (Ours) 0.753

The MeGA approach led to significant improvements in test accuracies across various models as shown in the Table 2. For example, the ResNet 56 model’s accuracy improved from 0.801 to 0.822, and the ResNet 110 model reached 0.824.

The Xception model’s accuracy increased from 0.723 to 0.754, demonstrating the effectiveness of MeGA in optimizing complex architectures. Similarly, the DenseNet 121 model’s accuracy rose from 0.724 to 0.742, and the DenseNet 169 model improved from 0.733 to 0.753. Overall, the MeGA approach consistently outperformed individual models, highlighting its potential for enhancing neural network performance through genetic algorithm-based weight merging.

Figure 2 shows the progression of the best fitness values over 20 generations for the ResNet-56 and DenseNet 121 models. For the ResNet-56 model, the best fitness improved steadily from 0.8042 in the first generation to 0.8224 in the twentieth generation. Similarly, the DenseNet 121 model saw an improvement in best fitness from 0.7290 to 0.7427 over the same period. These plots illustrate the genetic algorithm’s capability to effectively navigate the weight space and optimize combinations to enhance model performance.

The reason why the weight averaging method resulted in poorer performance is that the networks were initialized differently and trained independently. This led to discrepancies in the learned weights, making simple averaging ineffective. This highlights the importance of considering the initialization and training processes when merging models.

4.3 Extended Experiments for Multi-Models

In this section, we illustrate the application of our genetic algorithm-based weight merging method to combine multiple neural network models as depicted in Figure 3. The hierarchical merging process starts by training eight models on the CIFAR-10 dataset [12, 18]. These models are paired and merged using a genetic algorithm, resulting in four intermediate models. These intermediate models are further merged into two higher-level models, which are finally merged into a single, robust model.

Table 2: Performance Comparison of 8 Individual and Merged ResNet Models based on MeGA with CIFAR-10 Dataset

Dataset Model (# of Params) Baseline. Baseline Test Acc. Merging Method Final Test Acc. # of classes (Random Seed) (Single Model) (Merged Model) ResNet 56 (0.86M) [12] Model 1 (44) 0.804 Weight Average 0.010 Model 2 (45) 0.747 Model 3 (46) 0.788 Model 4 (47) 0.774 Model 5 (48) 0.806 MeGA (Ours) 0.822 Model 6 (49) 0.786 Model 7 (50) 0.749 Model 8 (51) 0.787 ResNet 110 (1.78M) [12] Model 1 (44) 0.782 Weight Average 0.010 Model 2 (45) 0.814 Model 3 (46) 0.744 CIFAR-10 [18] Model 4 (47) 0.800 (10) Model 5 (48) 0.801 MeGA (Ours) 0.824 Model 6 (49) 0.787 Model 7 (50) 0.781 Model 8 (51) 0.758 ResNet 152 (3.51M) [12] Model 1 (44) 0.775 Weight Average 0.010 Model 2 (45) 0.783 Model 3 (46) 0.781 Model 4 (47) 0.747 Model 5 (48) 0.771 MeGA (Ours) 0.815 Model 6 (49) 0.767 Model 7 (50) 0.772 Model 8 (51) 0.797

This process ensures the final model combines the strengths of all eight original models. The results in Table 2 demonstrate the effectiveness of our genetic algorithm-based weight merging method. For ResNet 56 models, baseline test accuracies ranged from 0.747 to 0.806, improving to 0.822 after merging. ResNet 110 models had baseline accuracies between 0.744 and 0.814, with the merged model achieving 0.824. For ResNet 152 models, baseline accuracies ranged from 0.747 to 0.783, and the merged model reached 0.815. Despite the larger complexity of ResNet 152, our method effectively enhanced model performance.

These results highlight the potential of our approach to enhance neural network performance. The consistent improvement in test accuracy across different architectures (ResNet 56, ResNet 110, and ResNet 152) demonstrates the versatility and scalability of our method. The hierarchical merging process ensures the final model benefits from the strengths of individual models.

5 Discussion

The application of genetic algorithm-based weight merging to combine multiple neural network models presents several significant advantages. This method is particularly beneficial for leveraging the strengths of various independently trained models, leading to enhanced overall performance. Belows are several significant advantages:

•

Capturing Complementary Features: Merging multiple neural networks captures complementary features learned by individual models. Each model specializes in recognizing different patterns within the data, and combining their weights results in improved accuracy and robustness.
•

Support for Distributed Environments: This method supports efficient and scalable training in distributed environments. By enabling the merging of models trained across multiple GPUs or devices, the approach facilitates distributed training, which is valuable in cloud-based systems or edge computing environments.
•

Reduced Inference Resource Usage: Using a single, high-performance merged model reduces inference resource usage compared to using multiple models. This is particularly beneficial in environments where computational resources are limited or where efficiency is critical, such as mobile or embedded systems.

The genetic algorithm-based weight merging technique significantly enhances the performance of neural networks by effectively combining multiple pre-trained models. This method improves accuracy and generalization while offering practical benefits in terms of training efficiency, scalability, and resource usage. The hierarchical merging approach underscores the potential of genetic algorithms in optimizing neural network weights, providing a robust and efficient tool for modern deep learning applications.

6 Conclusion

In this paper, we introduced a genetic algorithm-based method for merging the weights of multiple pre-trained neural networks. This approach demonstrated significant improvements in model performance and robustness by effectively combining the strengths of individual models. Our experiments on the CIFAR-10 dataset confirmed that the hierarchical merging process of eight ResNet56 models results in a final model with superior accuracy and generalization compared to traditional methods like weight averaging and ensemble techniques.

The genetic algorithm’s ability to optimize weight combinations through selection, crossover, and mutation allows for the creation of a more effective merged model without the need for additional training or architectural changes. This method also supports scalable and efficient training in distributed environments, making it a practical solution for various deep learning applications.

Overall, the genetic algorithm-based weight merging technique offers a powerful tool for enhancing neural network performance, providing a robust and efficient framework for integrating multiple pre-trained models. This work highlights the potential of genetic algorithms in neural network optimization, paving the way for further research and applications in artificial intelligence and machine learning.

References

[1] Chollet, F.: Xception: Deep learning with depthwise separable convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 1251–1258 (2017)
[2] Dean, J., Corrado, G., Monga, R., Chen, K., Devin, M., Mao, M., Senior, A., Tucker, P., Yang, K., Le, Q.V., et al.: Large scale distributed deep networks. Advances in neural information processing systems 25, 1223–1231 (2012)
[3] Dietterich, T.G.: Ensemble methods in machine learning. In: Multiple Classifier Systems, pp. 1–15. Springer (2000)
[4] Fernando, C., Banarse, D., Blundell, C., Zwols, Y., Ha, D., Rusu, A.A., Pritzel, A., Wierstra, D.: Pathnet: Evolution channels gradient descent in super neural networks. In: Proceedings of the 34th International Conference on Machine Learning-Volume 70. pp. 442–450 (2017)
[5] Fort, S., Hu, H., Lakshminarayanan, B.: Stiffness: A new perspective on generalization in neural networks. arXiv preprint arXiv:1901.09491 (2019)
[6] Frankle, J., Carbin, M.: The lottery ticket hypothesis: Finding sparse, trainable neural networks. arXiv preprint arXiv:1803.03635 (2019)
[7] Goldberg, D.E.: Genetic algorithms in search, optimization, and machine learning. Addison-Wesley Longman Publishing Co., Inc. (1989)
[8] Goodfellow, I., Bengio, Y., Courville, A.: Deep Learning. MIT Press (2016)
[9] Goodfellow, I.J., Shlens, J., Szegedy, C.: Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572 (2014)
[10] Goyal, P., Dollár, P., Girshick, R., Noordhuis, P., Wesolowski, L., Kyrola, A., Tulloch, A., Jia, Y., He, K.: Accurate, large minibatch sgd: Training imagenet in 1 hour. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 1567–1576 (2017)
[11] Hansen, L.K., Salamon, P.: Neural network ensembles. IEEE Transactions on Pattern Analysis and Machine Intelligence 12, 993–1001 (1990)
[12] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 770–778 (2016)
[13] Holland, J.H.: Adaptation in natural and artificial systems: an introductory analysis with applications to biology, control, and artificial intelligence. MIT press (1992)
[14] Huang, G., Li, Y., Pleiss, G., Liu, Z., Hopcroft, J.E., Weinberger, K.Q.: Snapshot ensembles: Train 1, get m for free. In: International Conference on Learning Representations (ICLR) (2017)
[15] Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q.: Densely connected convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 4700–4708 (2017)
[16] Izmailov, P., Podoprikhin, D., Garipov, T., Vetrov, D.P., Wilson, A.G.: Averaging weights leads to wider optima and better generalization. arXiv preprint arXiv:1803.05407 (2018)
[17] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
[18] Krizhevsky, A., Hinton, G.: Learning multiple layers of features from tiny images. Technical Report TR-2009, University of Toronto (2009)
[19] Krogh, A., Vedelsby, J.: Neural network ensembles, cross validation, and active learning. In: Tesauro, G., Touretzky, D., Leen, T. (eds.) Advances in Neural Information Processing Systems. vol. 7. MIT Press (1994)
[20] Lakshminarayanan, B., Pritzel, A., Blundell, C.: Simple and scalable predictive uncertainty estimation using deep ensembles. Advances in neural information processing systems 30 (2017)
[21] LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436–444 (2015)
[22] Li, M., Andersen, D.G., Park, J.W., Smola, A.J., Ahmed, A., Josifovski, V., Long, J., Shekita, E.J., Su, B.Y.: Scaling distributed machine learning with the parameter server. 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14) pp. 583–598 (2014)
[23] Liu, J., Wang, W., **, R., Shen, Y.: A survey of deep neural network architectures and their applications. Neurocomputing 234, 11–26 (2020)
[24] Loshchilov, I., Hutter, F.: Cma-es for hyperparameter optimization of deep neural networks. arXiv preprint arXiv:1604.07269 (2016)
[25] Mitchell, M.: An Introduction to Genetic Algorithms. MIT Press (1998)
[26] Real, E., Moore, S., Selle, A., Saxena, S., Suematsu, Y., Tan, J., Le, Q.V., Kurakin, A.: Large-scale evolution of image classifiers. In: Proceedings of the 34th International Conference on Machine Learning-Volume 70. pp. 2902–2911 (2017)
[27] Schmidhuber, J.: Deep learning in neural networks: An overview. Neural networks 61, 85–117 (2015)
[28] Shen, Z.Q., Kong, F.S.: Optimizing weights by genetic algorithm for neural network ensemble. In: Yin, F., Wang, J., Guo, C. (eds.) Advances in Neural Networks–ISNN 2004. Lecture Notes in Computer Science, vol. 3173, pp. 323–331. Springer, Springer, Berlin, Heidelberg (2004)
[29] Stanley, K.O., Miikkulainen, R.: Evolving neural networks through augmenting topologies. Evolutionary computation 10(2), 99–127 (2002)
[30] Wang, H., Srinivasa, S., Ozturk, I., Kaftan, I., Macciocca, J., Salakhutdinov, R., Lim, S.N.: Towards understanding learning representations: To what extent do different neural networks learn the same representation. Advances in Neural Information Processing Systems 33, 9607–9621 (2020)
[31] Welling, M., Teh, Y.W.: Bayesian learning via stochastic gradient langevin dynamics. In: Proceedings of the 28th international conference on machine learning (ICML-11). pp. 681–688 (2011)
[32] Xie, L., Yuille, A.L.: Genetic cnn. arXiv preprint arXiv:1703.01513 (2017)