11institutetext: Stony Brook University, Stony Brook NY 11794, USA
11email: [email protected]

MeGA: Merging Multiple Independently Trained Neural Networks Based on Genetic Algorithm

Daniel Yun 11
Abstract

In this paper, we introduce a novel method for merging the weights of multiple pre-trained neural networks using a genetic algorithm called MeGA. Traditional techniques, such as weight averaging and ensemble methods, often fail to fully harness the capabilities of pre-trained networks. Our approach leverages a genetic algorithm with tournament selection, crossover, and mutation to optimize weight combinations, creating a more effective fusion. This technique allows the merged model to inherit advantageous features from both parent models, resulting in enhanced accuracy and robustness. Through experiments on the CIFAR-10 dataset, we demonstrate that our genetic algorithm-based weight merging method improves test accuracy compared to individual models and conventional methods. This approach provides a scalable solution for integrating multiple pre-trained networks across various deep learning applications. Github is available at: Here

Keywords:
Neural Networks Deep Learning Genetic Algorithm Evolutionary Algorithms

1 Introduction

In recent years, deep learning has achieved state-of-the-art performance across various tasks, such as image classification and natural language processing, largely due to the use of pre-trained neural networks [8, 21, 27]. These networks can be fine-tuned for specific tasks, saving computational resources and time. However, effectively combining multiple pre-trained models of the same architecture to harness their collective strengths and mitigate individual weaknesses remains a significant challenge.

Merging weights from different pre-trained models with identical architectures is crucial because individual models often learn complementary features from the data [11, 3, 23]. Combining these models can create a more robust and accurate system, leveraging the strengths of each model and leading to improved performance. Additionally, model fusion can help achieve better generalization by averaging out individual biases and reducing overfitting [3, 20, 31].

Training models in parallel rather than sequentially offers efficiency and practicality. Multiple models can learn diverse patterns quickly, and merging their weights can achieve superior performance faster [14, 2, 10]. This approach is particularly advantageous in distributed learning environments, facilitating scalable and efficient training across multiple GPUs or devices [2, 22].

The primary challenge in weight merging is finding the optimal combination that maximizes performance without further training or altering the model’s architecture. Simple averaging fails to account for the intricate dependencies within the neural network [16, 5, 30]. To address this, we propose a novel approach using a genetic algorithm to optimize weight merging. Genetic algorithms efficiently search large, complex spaces by iteratively selecting, combining, and mutating weights [25, 13, 7].

In this paper, we introduce our genetic algorithm-based method, MeGA, for merging the weights of multiple trained CNN models. Experiments on the CIFAR-10 dataset [18] demonstrate improvements in test accuracy compared to individual models and traditional techniques. Rather than focusing on comparative analysis, this research emphasizes the methodology of effectively merging models to leverage their collective strengths. Our results highlight the potential of genetic algorithms in optimizing neural network weight fusion, providing a scalable solution for integrating multiple pre-trained models in various deep learning applications. Additionally, we demonstrate that it is possible to merge the weights of neural networks that have been initialized differently and trained independently, further underscoring the flexibility and robustness of our proposed method.

2 Related Works

The fusion of neural network models has been an active area of research due to its potential to enhance model performance by leveraging the strengths of individual models. Several methods have been proposed to address the challenges associated with model merging, each with unique approaches and varying degrees of success.

Weight Merge. One of the earliest and simplest methods for model fusion is weight averaging, where the weights of two or more models are averaged to create a new model. This approach, while straightforward, often fails to consider the complex interactions between different layers of the networks. Goodfellow et al. introduced an approach that averages the weights of neural networks trained with different initializations, showing marginal improvements in performance [9]. Similarly, ensemble methods, where the predictions of multiple models are combined, have been extensively studied and are known to improve model robustness and accuracy [19]. However, these methods do not directly merge the models’ internal representations, potentially limiting their effectiveness. Izmailov et al. proposed Stochastic Weight Averaging (SWA), which improves generalization by averaging weights along the trajectory of SGD with a cyclical or constant learning rate [16]. Welling and Teh applied Stochastic Gradient Langevin Dynamics (SGLD) to sample from the posterior distribution of model weights, thereby integrating Bayesian principles into model merging [31]. This method helps in capturing the model uncertainty but is computationally intensive and complex to implement. Huang et al. introduced the concept of Snapshot Ensembles, where a single neural network is trained with a cyclical learning rate schedule, and multiple snapshots of the model are taken at different local minima [14]. Shen and Kong [28] demonstrated the use of genetic algorithms to enhance prediction accuracy by optimizing weights in neural network ensembles.

Genetic Algorithms in Neural Networks. Genetic algorithms have been applied in various aspects of neural network optimization. Real et al. utilized evolutionary algorithms for neural architecture search, demonstrating the potential of genetic approaches in discovering optimal network structures [26]. Similarly, Xie and Yuille used genetic algorithms for model pruning, showing that evolutionary techniques can effectively optimize neural network parameters [32]. Stanley and Miikkulainen introduced NeuroEvolution of Augmenting Topologies (NEAT), which evolves both the architecture and weights of neural networks, leading to improved performance on complex tasks [29]. Another notable application is by Fernando et al., who proposed PathNet, a method that uses evolutionary strategies to discover pathways through a neural network, enabling efficient transfer learning [4]. Finally, Loshchilov and Hutter applied a genetic algorithm for hyperparameter optimization, showing that evolutionary strategies can outperform traditional grid and random search methods [24].

Building on the success of these methods, our approach leverages genetic algorithms to optimize the weight merging process of pre-trained CNN models. By iteratively selecting, combining, and mutating weights, our method systematically explores the weight space to discover an optimal or near-optimal set of weights. This novel approach not only enhances model performance but also facilitates efficient parallel training and distributed learning across multiple devices, addressing scalability and robustness issues inherent in traditional methods.

3 Methodology

In this section, we describe the methodology used to merge the weights of two pre-trained neural network models using a genetic algorithm. We call our methodology, MeGA. Genetic algorithms provide a robust optimization framework by mimicking natural selection, making them suitable for complex tasks like weight merging [25].

Our approach aligns with the lottery ticket hypothesis, which suggests that neural networks contain critical weights (winning tickets) necessary for maintaining and enhancing performance [6]. In our MeGA algorithm, child models inherit and preserve these beneficial weight configurations from their parent models. By iteratively selecting the best-performing individuals as parents, the algorithm ensures that critical weights are retained and combined, allowing child models to evolve and enhance overall performance. This method enables us to effectively navigate the weight space, merging multiple pre-trained models into a single, superior model that leverages the strengths of each original network.

3.1 Genetic Algorithm Framework

The process of merging two neural networks using a genetic algorithm is illustrated in Figure 1. The initial population is created by combining the element-wise weights from two pre-trained networks. Each individual in the population represents a potential solution with a unique combination of weights. The genetic algorithm iteratively selects the best-performing individuals as parents based on their evaluation, performs crossover to generate new children, and applies mutation to introduce variability. This process continues during several generations, resulting in a single merged network that combines the strengths of both original networks.

Refer to caption
Figure 1: MeGA: The process of merging two neural networks using a genetic algorithm.

Genetic algorithms operate on a population of potential solutions, iteratively improving them through the processes of selection, crossover, and mutation [25]. Each individual in the population represents a candidate solution, and its fitness is evaluated based on a predefined objective function. The algorithm proceeds by selecting individuals with higher fitness to produce offspring, combining their traits through crossover, and introducing random variations through mutation. Over successive generations, the population evolves towards better solutions [25].

Algorithm 1 Genetic Algorithm for Weight Merging (Element-wise)
0:  N𝑁Nitalic_N: Population size
0:  G𝐺Gitalic_G: Number of generations
0:  K𝐾Kitalic_K: Number of parents for tournament selection
0:  pmutsubscript𝑝mutp_{\text{mut}}italic_p start_POSTSUBSCRIPT mut end_POSTSUBSCRIPT: Mutation probability
0:  σ𝜎\sigmaitalic_σ: Standard deviation for Gaussian noise in mutation
0:  𝜽1subscript𝜽1\boldsymbol{\theta}_{1}bold_italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT: Weights of the first pre-trained model
0:  𝜽2subscript𝜽2\boldsymbol{\theta}_{2}bold_italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT: Weights of the second pre-trained model
0:  Optimal weights 𝜽superscript𝜽\boldsymbol{\theta}^{*}bold_italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT
1:  Initialization:
2:  Initialize population 𝒫𝒫\mathcal{P}caligraphic_P with N𝑁Nitalic_N individuals where each individual 𝜽isubscript𝜽𝑖\boldsymbol{\theta}_{i}bold_italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is a linear combination of 𝜽1subscript𝜽1\boldsymbol{\theta}_{1}bold_italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and 𝜽2subscript𝜽2\boldsymbol{\theta}_{2}bold_italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT
3:  for each individual n𝑛nitalic_n in 𝒫𝒫\mathcal{P}caligraphic_P do
4:     αUniform(0,1)similar-to𝛼Uniform01\alpha\sim\text{Uniform}(0,1)italic_α ∼ Uniform ( 0 , 1 )
5:     θi,j,k=αθ1,j,k+(1α)θ2,j,kj,ksubscript𝜃𝑖𝑗𝑘𝛼subscript𝜃1𝑗𝑘1𝛼subscript𝜃2𝑗𝑘for-all𝑗𝑘\theta_{i,j,k}=\alpha\theta_{1,j,k}+(1-\alpha)\theta_{2,j,k}\quad\forall j,kitalic_θ start_POSTSUBSCRIPT italic_i , italic_j , italic_k end_POSTSUBSCRIPT = italic_α italic_θ start_POSTSUBSCRIPT 1 , italic_j , italic_k end_POSTSUBSCRIPT + ( 1 - italic_α ) italic_θ start_POSTSUBSCRIPT 2 , italic_j , italic_k end_POSTSUBSCRIPT ∀ italic_j , italic_k
6:  end for
7:  Genetic Algorithm Execution:
8:  for generation = 1 to G𝐺Gitalic_G do
9:     Fitness Evaluation: Evaluate fitness F(𝜽n)𝐹subscript𝜽𝑛F(\boldsymbol{\theta}_{n})italic_F ( bold_italic_θ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) for each individual in 𝒫𝒫\mathcal{P}caligraphic_P
10:     for each individual i𝑖iitalic_i in 𝒫𝒫\mathcal{P}caligraphic_P do
11:        F(𝜽i)=Accuracy(f(𝐗val;𝜽i),𝐲val)𝐹subscript𝜽𝑖Accuracy𝑓subscript𝐗valsubscript𝜽𝑖subscript𝐲valF(\boldsymbol{\theta}_{i})=\text{Accuracy}(f(\mathbf{X}_{\text{val}};% \boldsymbol{\theta}_{i}),\mathbf{y}_{\text{val}})italic_F ( bold_italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = Accuracy ( italic_f ( bold_X start_POSTSUBSCRIPT val end_POSTSUBSCRIPT ; bold_italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , bold_y start_POSTSUBSCRIPT val end_POSTSUBSCRIPT )
12:     end for
13:     Selection: Select K𝐾Kitalic_K parents using tournament selection
14:     for each selection round from 1 to K𝐾Kitalic_K do
15:        Randomly select t𝑡titalic_t individuals to form tournament set 𝒯𝒯\mathcal{T}caligraphic_T
16:        𝒯={𝜽1,𝜽2,,𝜽t}𝒯subscript𝜽1subscript𝜽2subscript𝜽𝑡\mathcal{T}=\{\boldsymbol{\theta}_{1},\boldsymbol{\theta}_{2},\ldots,% \boldsymbol{\theta}_{t}\}caligraphic_T = { bold_italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT }
17:        Evaluate the fitness F(𝜽)𝐹𝜽F(\boldsymbol{\theta})italic_F ( bold_italic_θ ) of each individual in the tournament set
18:        Select the individual with the highest fitness as a parent: 𝜽parent=argmax𝜽𝒯F(𝜽)subscript𝜽parentsubscript𝜽𝒯𝐹𝜽\boldsymbol{\theta}_{\text{parent}}=\arg\max_{\boldsymbol{\theta}\in\mathcal{T% }}F(\boldsymbol{\theta})bold_italic_θ start_POSTSUBSCRIPT parent end_POSTSUBSCRIPT = roman_arg roman_max start_POSTSUBSCRIPT bold_italic_θ ∈ caligraphic_T end_POSTSUBSCRIPT italic_F ( bold_italic_θ )
19:        Add 𝜽parentsubscript𝜽parent\boldsymbol{\theta}_{\text{parent}}bold_italic_θ start_POSTSUBSCRIPT parent end_POSTSUBSCRIPT to the selected parents set
20:     end for
21:     Crossover: Generate offspring through element-wise crossover of selected parents
22:     for each pair of parents (𝜽parent1,𝜽parent2)subscript𝜽parent1subscript𝜽parent2(\boldsymbol{\theta}_{\text{parent1}},\boldsymbol{\theta}_{\text{parent2}})( bold_italic_θ start_POSTSUBSCRIPT parent1 end_POSTSUBSCRIPT , bold_italic_θ start_POSTSUBSCRIPT parent2 end_POSTSUBSCRIPT ) in selected parents do
23:        αUniform(0,1)similar-to𝛼Uniform01\alpha\sim\text{Uniform}(0,1)italic_α ∼ Uniform ( 0 , 1 )
24:        θchild,j,k=αθparent1,j,k+(1α)θparent2,j,kj,ksubscript𝜃child𝑗𝑘𝛼subscript𝜃parent1𝑗𝑘1𝛼subscript𝜃parent2𝑗𝑘for-all𝑗𝑘\theta_{\text{child},j,k}=\alpha\theta_{\text{parent1},j,k}+(1-\alpha)\theta_{% \text{parent2},j,k}\quad\forall j,kitalic_θ start_POSTSUBSCRIPT child , italic_j , italic_k end_POSTSUBSCRIPT = italic_α italic_θ start_POSTSUBSCRIPT parent1 , italic_j , italic_k end_POSTSUBSCRIPT + ( 1 - italic_α ) italic_θ start_POSTSUBSCRIPT parent2 , italic_j , italic_k end_POSTSUBSCRIPT ∀ italic_j , italic_k
25:        Add 𝜽childsubscript𝜽child\boldsymbol{\theta}_{\text{child}}bold_italic_θ start_POSTSUBSCRIPT child end_POSTSUBSCRIPT to the offspring set
26:     end for
27:     Mutation: Apply mutation to offspring
28:     for each individual 𝜽childsubscript𝜽child\boldsymbol{\theta}_{\text{child}}bold_italic_θ start_POSTSUBSCRIPT child end_POSTSUBSCRIPT in offspring do
29:        for each weight element θj,ksubscript𝜃𝑗𝑘\theta_{j,k}italic_θ start_POSTSUBSCRIPT italic_j , italic_k end_POSTSUBSCRIPT in 𝜽childsubscript𝜽child\boldsymbol{\theta}_{\text{child}}bold_italic_θ start_POSTSUBSCRIPT child end_POSTSUBSCRIPT do
30:           if random number <pmutabsentsubscript𝑝mut<p_{\text{mut}}< italic_p start_POSTSUBSCRIPT mut end_POSTSUBSCRIPT then
31:              θj,k=θj,k+ϵsubscript𝜃𝑗𝑘subscript𝜃𝑗𝑘italic-ϵ\theta_{j,k}=\theta_{j,k}+\epsilonitalic_θ start_POSTSUBSCRIPT italic_j , italic_k end_POSTSUBSCRIPT = italic_θ start_POSTSUBSCRIPT italic_j , italic_k end_POSTSUBSCRIPT + italic_ϵ, where ϵ𝒩(0,σ2)similar-toitalic-ϵ𝒩0superscript𝜎2\epsilon\sim\mathcal{N}(0,\sigma^{2})italic_ϵ ∼ caligraphic_N ( 0 , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT )
32:           end if
33:        end for
34:     end for
35:     Population Update: Form new population 𝒫superscript𝒫\mathcal{P}^{\prime}caligraphic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT with the best individual from 𝒫𝒫\mathcal{P}caligraphic_P (elitism) and newly generated offspring
36:     Keep the best individual from 𝒫𝒫\mathcal{P}caligraphic_P based on fitness
37:     𝒫𝒫𝒫superscript𝒫\mathcal{P}\leftarrow\mathcal{P}^{\prime}caligraphic_P ← caligraphic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT
38:     Update the best individual 𝜽superscript𝜽\boldsymbol{\theta}^{*}bold_italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT if a better fitness is found in the current generation
39:  end for
40:  Output: Best individual 𝜽superscript𝜽\boldsymbol{\theta}^{*}bold_italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT with the highest fitness

3.1.1 Preliminaries.

Let 𝐗n×d𝐗superscript𝑛𝑑\mathbf{X}\in\mathbb{R}^{n\times d}bold_X ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_d end_POSTSUPERSCRIPT be the training data, where n𝑛nitalic_n is the number of samples and d𝑑ditalic_d is the dimensionality of each sample. The corresponding labels are denoted by 𝐲n𝐲superscript𝑛\mathbf{y}\in\mathbb{R}^{n}bold_y ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT. We define two pre-trained neural networks, f1(𝐗;𝜽1)subscript𝑓1𝐗subscript𝜽1f_{1}(\mathbf{X};\boldsymbol{\theta}_{1})italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_X ; bold_italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) and f2(𝐗;𝜽2)subscript𝑓2𝐗subscript𝜽2f_{2}(\mathbf{X};\boldsymbol{\theta}_{2})italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( bold_X ; bold_italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ), where 𝜽1subscript𝜽1\boldsymbol{\theta}_{1}bold_italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and 𝜽2subscript𝜽2\boldsymbol{\theta}_{2}bold_italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT represent the weights of the respective models. Our objective is to merge these weights to create a new model f(𝐗;𝜽)𝑓𝐗𝜽f(\mathbf{X};\boldsymbol{\theta})italic_f ( bold_X ; bold_italic_θ ) that achieves superior performance.

3.1.2 Algorithm Execution.

The genetic algorithm iterates through G𝐺Gitalic_G generations, performing selection, crossover, and mutation at each step. The best individual 𝜽superscript𝜽\boldsymbol{\theta}^{*}bold_italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT with the highest fitness across all generations is selected as the final set of weights for the fused model. The overall procedure is summarized as follows:

  1. 1.

    Initialization: Create an initial population 𝒫𝒫\mathcal{P}caligraphic_P of N𝑁Nitalic_N individuals.

  2. 2.

    Fitness Evaluation: Evaluate the fitness F(𝜽)𝐹𝜽F(\boldsymbol{\theta})italic_F ( bold_italic_θ ) for each individual in 𝒫𝒫\mathcal{P}caligraphic_P.

  3. 3.

    Selection: Select K𝐾Kitalic_K parents from 𝒫𝒫\mathcal{P}caligraphic_P using tournament selection.

  4. 4.

    Crossover: Generate offspring through crossover of selected parents.

  5. 5.

    Mutation: Apply mutation to the offspring.

  6. 6.

    Population Update: Form the new population 𝒫superscript𝒫\mathcal{P}^{\prime}caligraphic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT with the best individual from 𝒫𝒫\mathcal{P}caligraphic_P (elitism) and the newly generated offspring.

  7. 7.

    Iteration: Repeat steps 2-6 for G𝐺Gitalic_G generations.

3.1.3 Initialization

The initial population 𝒫𝒫\mathcal{P}caligraphic_P consists of N𝑁Nitalic_N individuals, where each individual represents a potential solution in the form of a set of weights 𝜽𝜽\boldsymbol{\theta}bold_italic_θ. Each individual is initialized to ensure diversity, which is crucial for the effectiveness of the genetic algorithm.

The weights of each individual 𝜽isubscript𝜽𝑖\boldsymbol{\theta}_{i}bold_italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are initialized as an element-wise linear combination of the weights from two pre-trained models, 𝜽1subscript𝜽1\boldsymbol{\theta}_{1}bold_italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and 𝜽2subscript𝜽2\boldsymbol{\theta}_{2}bold_italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT:

θi,j,k=αθ1,j,k+(1α)θ2,j,k,αUniform(0,1)formulae-sequencesubscript𝜃𝑖𝑗𝑘𝛼subscript𝜃1𝑗𝑘1𝛼subscript𝜃2𝑗𝑘similar-to𝛼Uniform01\theta_{i,j,k}=\alpha\theta_{1,j,k}+(1-\alpha)\theta_{2,j,k},\quad\alpha\sim% \text{Uniform}(0,1)italic_θ start_POSTSUBSCRIPT italic_i , italic_j , italic_k end_POSTSUBSCRIPT = italic_α italic_θ start_POSTSUBSCRIPT 1 , italic_j , italic_k end_POSTSUBSCRIPT + ( 1 - italic_α ) italic_θ start_POSTSUBSCRIPT 2 , italic_j , italic_k end_POSTSUBSCRIPT , italic_α ∼ Uniform ( 0 , 1 )

Here, α𝛼\alphaitalic_α is a random scalar drawn from a uniform distribution between 0 and 1. This ensures that each individual’s weights are a unique blend of the two parent models, with the combination applied element-wise to preserve the detailed characteristics of both models.

3.1.4 Fitness Evaluation

The fitness of each individual in the population is assessed based on the validation accuracy on a hold-out validation set (𝐗val,𝐲val)subscript𝐗valsubscript𝐲val(\mathbf{X}_{\text{val}},\mathbf{y}_{\text{val}})( bold_X start_POSTSUBSCRIPT val end_POSTSUBSCRIPT , bold_y start_POSTSUBSCRIPT val end_POSTSUBSCRIPT ). For an individual with weights 𝜽isubscript𝜽𝑖\boldsymbol{\theta}_{i}bold_italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, the fitness function F(𝜽i)𝐹subscript𝜽𝑖F(\boldsymbol{\theta}_{i})italic_F ( bold_italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) is defined as follows:

F(𝜽i)=Accuracy(f(𝐗val;𝜽i),𝐲val)𝐹subscript𝜽𝑖Accuracy𝑓subscript𝐗valsubscript𝜽𝑖subscript𝐲valF(\boldsymbol{\theta}_{i})=\text{Accuracy}(f(\mathbf{X}_{\text{val}};% \boldsymbol{\theta}_{i}),\mathbf{y}_{\text{val}})italic_F ( bold_italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = Accuracy ( italic_f ( bold_X start_POSTSUBSCRIPT val end_POSTSUBSCRIPT ; bold_italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , bold_y start_POSTSUBSCRIPT val end_POSTSUBSCRIPT )

This function measures how accurately the model with weights 𝜽isubscript𝜽𝑖\boldsymbol{\theta}_{i}bold_italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT predicts the labels of the validation set. The model f(𝐗;𝜽i)𝑓𝐗subscript𝜽𝑖f(\mathbf{X};\boldsymbol{\theta}_{i})italic_f ( bold_X ; bold_italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) is evaluated, and the accuracy is computed as the proportion of correctly classified samples in 𝐗valsubscript𝐗val\mathbf{X}_{\text{val}}bold_X start_POSTSUBSCRIPT val end_POSTSUBSCRIPT.

Mathematically, the accuracy is given by:

Accuracy(f(𝐗;𝜽),𝐲)=1nvali=1nval𝕀(argmaxjfj(𝐗i;𝜽)=yi)Accuracy𝑓𝐗𝜽𝐲1subscript𝑛valsuperscriptsubscript𝑖1subscript𝑛val𝕀subscript𝑗subscript𝑓𝑗subscript𝐗𝑖𝜽subscript𝑦𝑖\text{Accuracy}(f(\mathbf{X};\boldsymbol{\theta}),\mathbf{y})=\frac{1}{n_{% \text{val}}}\sum_{i=1}^{n_{\text{val}}}\mathbb{I}\left(\arg\max_{j}f_{j}(% \mathbf{X}_{i};\boldsymbol{\theta})=y_{i}\right)Accuracy ( italic_f ( bold_X ; bold_italic_θ ) , bold_y ) = divide start_ARG 1 end_ARG start_ARG italic_n start_POSTSUBSCRIPT val end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT val end_POSTSUBSCRIPT end_POSTSUPERSCRIPT blackboard_I ( roman_arg roman_max start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; bold_italic_θ ) = italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT )

where nvalsubscript𝑛valn_{\text{val}}italic_n start_POSTSUBSCRIPT val end_POSTSUBSCRIPT is the number of validation samples, fj(𝐗i;𝜽)subscript𝑓𝑗subscript𝐗𝑖𝜽f_{j}(\mathbf{X}_{i};\boldsymbol{\theta})italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; bold_italic_θ ) is the predicted probability for class j𝑗jitalic_j for the i𝑖iitalic_i-th validation sample, and 𝕀𝕀\mathbb{I}blackboard_I is the indicator function that equals 1 if the predicted class matches the true label yisubscript𝑦𝑖y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, and 0 otherwise.

3.1.5 Selection

Tournament selection is employed to choose K𝐾Kitalic_K parents for the next generation. This method ensures that individuals with higher fitness are more likely to be selected, promoting desirable traits in subsequent generations. The selection process involves several steps.

First, we randomly select t𝑡titalic_t individuals from the current population 𝒫𝒫\mathcal{P}caligraphic_P to form a tournament set 𝒯𝒯\mathcal{T}caligraphic_T:

𝒯={𝜽i1,𝜽i2,,𝜽it},𝒯𝒫formulae-sequence𝒯subscript𝜽subscript𝑖1subscript𝜽subscript𝑖2subscript𝜽subscript𝑖𝑡𝒯𝒫\mathcal{T}=\{\boldsymbol{\theta}_{i_{1}},\boldsymbol{\theta}_{i_{2}},\ldots,% \boldsymbol{\theta}_{i_{t}}\},\quad\mathcal{T}\subseteq\mathcal{P}caligraphic_T = { bold_italic_θ start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , bold_italic_θ start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , … , bold_italic_θ start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT } , caligraphic_T ⊆ caligraphic_P

Next, the fitness F(𝜽)𝐹𝜽F(\boldsymbol{\theta})italic_F ( bold_italic_θ ) of each individual 𝜽𝜽\boldsymbol{\theta}bold_italic_θ in the tournament set 𝒯𝒯\mathcal{T}caligraphic_T is evaluated using the previously defined fitness function. After evaluating the fitness of all individuals in the tournament set, the individual with the highest fitness is selected as a parent:

𝜽parent=argmax𝜽𝒯F(𝜽)subscript𝜽parentsubscript𝜽𝒯𝐹𝜽\boldsymbol{\theta}_{\text{parent}}=\arg\max_{\boldsymbol{\theta}\in\mathcal{T% }}F(\boldsymbol{\theta})bold_italic_θ start_POSTSUBSCRIPT parent end_POSTSUBSCRIPT = roman_arg roman_max start_POSTSUBSCRIPT bold_italic_θ ∈ caligraphic_T end_POSTSUBSCRIPT italic_F ( bold_italic_θ )

This process is repeated until K𝐾Kitalic_K parents are selected for crossover.

3.1.6 Crossover

Crossover is performed on pairs of selected parents to generate offspring. This process combines the genetic material (weights) of two parents to produce a new individual (offspring), thereby promoting genetic diversity. Given two parents 𝜽asubscript𝜽𝑎\boldsymbol{\theta}_{a}bold_italic_θ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT and 𝜽bsubscript𝜽𝑏\boldsymbol{\theta}_{b}bold_italic_θ start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT, an offspring 𝜽childsubscript𝜽child\boldsymbol{\theta}_{\text{child}}bold_italic_θ start_POSTSUBSCRIPT child end_POSTSUBSCRIPT is created by taking a weighted combination of the parents’ weights. The crossover is performed element-wise for each weight:

θchild,j,k=βθa,j,k+(1β)θb,j,k,βUniform(0,1)formulae-sequencesubscript𝜃child𝑗𝑘𝛽subscript𝜃𝑎𝑗𝑘1𝛽subscript𝜃𝑏𝑗𝑘similar-to𝛽Uniform01\theta_{\text{child},j,k}=\beta\theta_{a,j,k}+(1-\beta)\theta_{b,j,k},\quad% \beta\sim\text{Uniform}(0,1)italic_θ start_POSTSUBSCRIPT child , italic_j , italic_k end_POSTSUBSCRIPT = italic_β italic_θ start_POSTSUBSCRIPT italic_a , italic_j , italic_k end_POSTSUBSCRIPT + ( 1 - italic_β ) italic_θ start_POSTSUBSCRIPT italic_b , italic_j , italic_k end_POSTSUBSCRIPT , italic_β ∼ Uniform ( 0 , 1 )

3.1.7 Mutation

Mutation introduces variability into the population by perturbing the weights of the offspring. For each weight θj,ksubscript𝜃𝑗𝑘\theta_{j,k}italic_θ start_POSTSUBSCRIPT italic_j , italic_k end_POSTSUBSCRIPT in the offspring 𝜽childsubscript𝜽child\boldsymbol{\theta}_{\text{child}}bold_italic_θ start_POSTSUBSCRIPT child end_POSTSUBSCRIPT, mutation is applied with a probability pmutsubscript𝑝mutp_{\text{mut}}italic_p start_POSTSUBSCRIPT mut end_POSTSUBSCRIPT:

θchild,j,kθchild,j,k+ϵ,ϵ𝒩(0,σ2)formulae-sequencesubscript𝜃𝑐𝑖𝑙𝑑𝑗𝑘subscript𝜃𝑐𝑖𝑙𝑑𝑗𝑘italic-ϵsimilar-toitalic-ϵ𝒩0superscript𝜎2\theta_{child,j,k}\leftarrow\theta_{child,j,k}+\epsilon,\quad\epsilon\sim% \mathcal{N}(0,\sigma^{2})italic_θ start_POSTSUBSCRIPT italic_c italic_h italic_i italic_l italic_d , italic_j , italic_k end_POSTSUBSCRIPT ← italic_θ start_POSTSUBSCRIPT italic_c italic_h italic_i italic_l italic_d , italic_j , italic_k end_POSTSUBSCRIPT + italic_ϵ , italic_ϵ ∼ caligraphic_N ( 0 , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT )

Here, ϵitalic-ϵ\epsilonitalic_ϵ is a random variable drawn from a normal distribution with mean 0 and variance σ2superscript𝜎2\sigma^{2}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. The mutation rate pmutsubscript𝑝mutp_{\text{mut}}italic_p start_POSTSUBSCRIPT mut end_POSTSUBSCRIPT controls the likelihood of each weight being mutated. This mutation process ensures that the population maintains genetic diversity by introducing new genetic material.

4 Experimental Results

4.1 Experimental Settings

We utilized the CIFAR-10 dataset [18]. The CIFAR-10 training dataset was split into 45,000 training and 5,000 validation sets. The validation set was used during the model validation and merging process. We use CNN models: ResNet [12], Xception [1], and DenseNet [15] without pre-trained weights and without data augmentation. These models were trained using a batch size of 256 over 50 epochs with the Adam optimizer [17], using a learning rate of 0.01.

Each model followed the same architecture and hyperparameters to ensure consistency and a fair comparison. After training, we applied our MeGA approach to merge the weights of these models into a single set of weights. The performance of the merged model was then evaluated and compared against the individual models. For all experiments, we used the following MeGA hyperparameters: a population size of 20, 20 generations, 4 parents per generation, and a mutation rate of 0.02.

4.2 CIFAR-10 Results

The results of the image classification experiments on the CIFAR-10 dataset [18] demonstrate the effectiveness of our MeGA.

Table 1: Performance Comparison of Individual and Merged CNN Models based on MeGA with CIFAR-10 Dataset

Datasets Model (# of Params.) Baseline. Baseline Test Acc. Merging Method Final Test Acc. # of classes (Random Seed) (Single Model) (Merged Model) ResNet 56 (0.86M) [12] Model 1 (56) 0.809 Weight Average 0.010 Model 2 (57) 0.793 MeGA (Ours) 0.822 ResNet 110 (1.78M) [12] Model 1 (56) 0.801 Weight Average 0.010 Model 2 (57) 0.814 MeGA (Ours) 0.816 ResNet 152 (3.51M) [12] Model 1 (56) 0.783 Weight Average 0.010 CIFAR-10 [18] Model 2 (57) 0.781 MeGA (Ours) 0.819 (10) Xception (22.96M) [12] Model 1 (56) 0.716 Weight Average 0.010 Model 2 (57) 0.730 MeGA (Ours) 0.754 DenseNet 121 (7.04M) [12] Model 1 (56) 0.719 Weight Average 0.010 Model 2 (57) 0.728 MeGA (Ours) 0.742 DenseNet 169 (12.65M) [12] Model 1 (56) 0.735 Weight Average 0.010 Model 2 (57) 0.731 MeGA (Ours) 0.753

The MeGA approach led to significant improvements in test accuracies across various models as shown in the Table 2. For example, the ResNet 56 model’s accuracy improved from 0.801 to 0.822, and the ResNet 110 model reached 0.824.

Refer to caption
Figure 2: MeGA Fitness Progression (Test Accuracy) for ResNet-56 and DenseNet 121 models over 20 generations.

The Xception model’s accuracy increased from 0.723 to 0.754, demonstrating the effectiveness of MeGA in optimizing complex architectures. Similarly, the DenseNet 121 model’s accuracy rose from 0.724 to 0.742, and the DenseNet 169 model improved from 0.733 to 0.753. Overall, the MeGA approach consistently outperformed individual models, highlighting its potential for enhancing neural network performance through genetic algorithm-based weight merging.

Figure 2 shows the progression of the best fitness values over 20 generations for the ResNet-56 and DenseNet 121 models. For the ResNet-56 model, the best fitness improved steadily from 0.8042 in the first generation to 0.8224 in the twentieth generation. Similarly, the DenseNet 121 model saw an improvement in best fitness from 0.7290 to 0.7427 over the same period. These plots illustrate the genetic algorithm’s capability to effectively navigate the weight space and optimize combinations to enhance model performance.

The reason why the weight averaging method resulted in poorer performance is that the networks were initialized differently and trained independently. This led to discrepancies in the learned weights, making simple averaging ineffective. This highlights the importance of considering the initialization and training processes when merging models.

4.3 Extended Experiments for Multi-Models

In this section, we illustrate the application of our genetic algorithm-based weight merging method to combine multiple neural network models as depicted in Figure 3. The hierarchical merging process starts by training eight models on the CIFAR-10 dataset [12, 18]. These models are paired and merged using a genetic algorithm, resulting in four intermediate models. These intermediate models are further merged into two higher-level models, which are finally merged into a single, robust model.

Refer to caption
Figure 3: Hierarchical merging process of eight models into a single final model.
Table 2: Performance Comparison of 8 Individual and Merged ResNet Models based on MeGA with CIFAR-10 Dataset

Dataset Model (# of Params) Baseline. Baseline Test Acc. Merging Method Final Test Acc. # of classes (Random Seed) (Single Model) (Merged Model) ResNet 56 (0.86M) [12] Model 1 (44) 0.804 Weight Average 0.010 Model 2 (45) 0.747 Model 3 (46) 0.788 Model 4 (47) 0.774 Model 5 (48) 0.806 MeGA (Ours) 0.822 Model 6 (49) 0.786 Model 7 (50) 0.749 Model 8 (51) 0.787 ResNet 110 (1.78M) [12] Model 1 (44) 0.782 Weight Average 0.010 Model 2 (45) 0.814 Model 3 (46) 0.744 CIFAR-10 [18] Model 4 (47) 0.800 (10) Model 5 (48) 0.801 MeGA (Ours) 0.824 Model 6 (49) 0.787 Model 7 (50) 0.781 Model 8 (51) 0.758 ResNet 152 (3.51M) [12] Model 1 (44) 0.775 Weight Average 0.010 Model 2 (45) 0.783 Model 3 (46) 0.781 Model 4 (47) 0.747 Model 5 (48) 0.771 MeGA (Ours) 0.815 Model 6 (49) 0.767 Model 7 (50) 0.772 Model 8 (51) 0.797

This process ensures the final model combines the strengths of all eight original models. The results in Table 2 demonstrate the effectiveness of our genetic algorithm-based weight merging method. For ResNet 56 models, baseline test accuracies ranged from 0.747 to 0.806, improving to 0.822 after merging. ResNet 110 models had baseline accuracies between 0.744 and 0.814, with the merged model achieving 0.824. For ResNet 152 models, baseline accuracies ranged from 0.747 to 0.783, and the merged model reached 0.815. Despite the larger complexity of ResNet 152, our method effectively enhanced model performance.

These results highlight the potential of our approach to enhance neural network performance. The consistent improvement in test accuracy across different architectures (ResNet 56, ResNet 110, and ResNet 152) demonstrates the versatility and scalability of our method. The hierarchical merging process ensures the final model benefits from the strengths of individual models.

5 Discussion

The application of genetic algorithm-based weight merging to combine multiple neural network models presents several significant advantages. This method is particularly beneficial for leveraging the strengths of various independently trained models, leading to enhanced overall performance. Belows are several significant advantages:

  • Capturing Complementary Features: Merging multiple neural networks captures complementary features learned by individual models. Each model specializes in recognizing different patterns within the data, and combining their weights results in improved accuracy and robustness.

  • Support for Distributed Environments: This method supports efficient and scalable training in distributed environments. By enabling the merging of models trained across multiple GPUs or devices, the approach facilitates distributed training, which is valuable in cloud-based systems or edge computing environments.

  • Reduced Inference Resource Usage: Using a single, high-performance merged model reduces inference resource usage compared to using multiple models. This is particularly beneficial in environments where computational resources are limited or where efficiency is critical, such as mobile or embedded systems.

The genetic algorithm-based weight merging technique significantly enhances the performance of neural networks by effectively combining multiple pre-trained models. This method improves accuracy and generalization while offering practical benefits in terms of training efficiency, scalability, and resource usage. The hierarchical merging approach underscores the potential of genetic algorithms in optimizing neural network weights, providing a robust and efficient tool for modern deep learning applications.

6 Conclusion

In this paper, we introduced a genetic algorithm-based method for merging the weights of multiple pre-trained neural networks. This approach demonstrated significant improvements in model performance and robustness by effectively combining the strengths of individual models. Our experiments on the CIFAR-10 dataset confirmed that the hierarchical merging process of eight ResNet56 models results in a final model with superior accuracy and generalization compared to traditional methods like weight averaging and ensemble techniques.

The genetic algorithm’s ability to optimize weight combinations through selection, crossover, and mutation allows for the creation of a more effective merged model without the need for additional training or architectural changes. This method also supports scalable and efficient training in distributed environments, making it a practical solution for various deep learning applications.

Overall, the genetic algorithm-based weight merging technique offers a powerful tool for enhancing neural network performance, providing a robust and efficient framework for integrating multiple pre-trained models. This work highlights the potential of genetic algorithms in neural network optimization, paving the way for further research and applications in artificial intelligence and machine learning.

References

  • [1] Chollet, F.: Xception: Deep learning with depthwise separable convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 1251–1258 (2017)
  • [2] Dean, J., Corrado, G., Monga, R., Chen, K., Devin, M., Mao, M., Senior, A., Tucker, P., Yang, K., Le, Q.V., et al.: Large scale distributed deep networks. Advances in neural information processing systems 25, 1223–1231 (2012)
  • [3] Dietterich, T.G.: Ensemble methods in machine learning. In: Multiple Classifier Systems, pp. 1–15. Springer (2000)
  • [4] Fernando, C., Banarse, D., Blundell, C., Zwols, Y., Ha, D., Rusu, A.A., Pritzel, A., Wierstra, D.: Pathnet: Evolution channels gradient descent in super neural networks. In: Proceedings of the 34th International Conference on Machine Learning-Volume 70. pp. 442–450 (2017)
  • [5] Fort, S., Hu, H., Lakshminarayanan, B.: Stiffness: A new perspective on generalization in neural networks. arXiv preprint arXiv:1901.09491 (2019)
  • [6] Frankle, J., Carbin, M.: The lottery ticket hypothesis: Finding sparse, trainable neural networks. arXiv preprint arXiv:1803.03635 (2019)
  • [7] Goldberg, D.E.: Genetic algorithms in search, optimization, and machine learning. Addison-Wesley Longman Publishing Co., Inc. (1989)
  • [8] Goodfellow, I., Bengio, Y., Courville, A.: Deep Learning. MIT Press (2016)
  • [9] Goodfellow, I.J., Shlens, J., Szegedy, C.: Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572 (2014)
  • [10] Goyal, P., Dollár, P., Girshick, R., Noordhuis, P., Wesolowski, L., Kyrola, A., Tulloch, A., Jia, Y., He, K.: Accurate, large minibatch sgd: Training imagenet in 1 hour. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 1567–1576 (2017)
  • [11] Hansen, L.K., Salamon, P.: Neural network ensembles. IEEE Transactions on Pattern Analysis and Machine Intelligence 12, 993–1001 (1990)
  • [12] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 770–778 (2016)
  • [13] Holland, J.H.: Adaptation in natural and artificial systems: an introductory analysis with applications to biology, control, and artificial intelligence. MIT press (1992)
  • [14] Huang, G., Li, Y., Pleiss, G., Liu, Z., Hopcroft, J.E., Weinberger, K.Q.: Snapshot ensembles: Train 1, get m for free. In: International Conference on Learning Representations (ICLR) (2017)
  • [15] Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q.: Densely connected convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 4700–4708 (2017)
  • [16] Izmailov, P., Podoprikhin, D., Garipov, T., Vetrov, D.P., Wilson, A.G.: Averaging weights leads to wider optima and better generalization. arXiv preprint arXiv:1803.05407 (2018)
  • [17] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
  • [18] Krizhevsky, A., Hinton, G.: Learning multiple layers of features from tiny images. Technical Report TR-2009, University of Toronto (2009)
  • [19] Krogh, A., Vedelsby, J.: Neural network ensembles, cross validation, and active learning. In: Tesauro, G., Touretzky, D., Leen, T. (eds.) Advances in Neural Information Processing Systems. vol. 7. MIT Press (1994)
  • [20] Lakshminarayanan, B., Pritzel, A., Blundell, C.: Simple and scalable predictive uncertainty estimation using deep ensembles. Advances in neural information processing systems 30 (2017)
  • [21] LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436–444 (2015)
  • [22] Li, M., Andersen, D.G., Park, J.W., Smola, A.J., Ahmed, A., Josifovski, V., Long, J., Shekita, E.J., Su, B.Y.: Scaling distributed machine learning with the parameter server. 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14) pp. 583–598 (2014)
  • [23] Liu, J., Wang, W., **, R., Shen, Y.: A survey of deep neural network architectures and their applications. Neurocomputing 234, 11–26 (2020)
  • [24] Loshchilov, I., Hutter, F.: Cma-es for hyperparameter optimization of deep neural networks. arXiv preprint arXiv:1604.07269 (2016)
  • [25] Mitchell, M.: An Introduction to Genetic Algorithms. MIT Press (1998)
  • [26] Real, E., Moore, S., Selle, A., Saxena, S., Suematsu, Y., Tan, J., Le, Q.V., Kurakin, A.: Large-scale evolution of image classifiers. In: Proceedings of the 34th International Conference on Machine Learning-Volume 70. pp. 2902–2911 (2017)
  • [27] Schmidhuber, J.: Deep learning in neural networks: An overview. Neural networks 61, 85–117 (2015)
  • [28] Shen, Z.Q., Kong, F.S.: Optimizing weights by genetic algorithm for neural network ensemble. In: Yin, F., Wang, J., Guo, C. (eds.) Advances in Neural Networks–ISNN 2004. Lecture Notes in Computer Science, vol. 3173, pp. 323–331. Springer, Springer, Berlin, Heidelberg (2004)
  • [29] Stanley, K.O., Miikkulainen, R.: Evolving neural networks through augmenting topologies. Evolutionary computation 10(2), 99–127 (2002)
  • [30] Wang, H., Srinivasa, S., Ozturk, I., Kaftan, I., Macciocca, J., Salakhutdinov, R., Lim, S.N.: Towards understanding learning representations: To what extent do different neural networks learn the same representation. Advances in Neural Information Processing Systems 33, 9607–9621 (2020)
  • [31] Welling, M., Teh, Y.W.: Bayesian learning via stochastic gradient langevin dynamics. In: Proceedings of the 28th international conference on machine learning (ICML-11). pp. 681–688 (2011)
  • [32] Xie, L., Yuille, A.L.: Genetic cnn. arXiv preprint arXiv:1703.01513 (2017)