Dataless Quadratic Neural Networks for the Maximum Independent Set Problem
Abstract
Combinatorial Optimization (CO) plays a crucial role in addressing various significant problems, among them the challenging Maximum Independent Set (MIS) problem. In light of recent advancements in deep learning methods, efforts have been directed towards leveraging data-driven learning approaches, typically rooted in supervised learning and reinforcement learning, to tackle the NP-hard MIS problem. However, these approaches rely on labeled datasets, exhibit weak generalization, and often depend on problem-specific heuristics. Recently, ReLU-based dataless neural networks were introduced to address combinatorial optimization problems. This paper introduces a novel dataless quadratic neural network formulation, featuring a continuous quadratic relaxation for the MIS problem. Notably, our method eliminates the need for training data by treating the given MIS instance as a trainable entity. More specifically, the graph structure and constraints of the MIS instance are used to define the structure and parameters of the neural network such that training it on a fixed input provides a solution to the problem, thereby setting it apart from traditional supervised or reinforcement learning approaches. By employing a gradient-based optimization algorithm like ADAM and leveraging an efficient off-the-shelf GPU parallel implementation, our straightforward yet effective approach demonstrates competitive or superior performance compared to state-of-the-art learning-based methods. Another significant advantage of our approach is that, unlike exact and heuristic solvers, the running time of our method scales only with the number of nodes in the graph, not the number of edges.
1 Introduction
In his landmark paper [1], Richard Karp introduced the concept of reducibility among combinatorial problems that are complete for the complexity class . This pivotal work established a connection between combinatorial optimization problems and the - complexity class, implying their inherent computational challenges. Although these problems are notorious for their intractability, they have proven to be foundational in various sectors [2], demonstrating their widespread applicability. While a polynomial-time solver remains elusive for solving - problems with respect to (w.r.t.) the input size, various efficient solvers have been developed [3]. Such solvers can be broadly classified into heuristic algorithms [4], branch-and-bound-based global optimization methods [5], and approximation algorithms [6].
In the - complexity class, one of the most fundamental problems is the ‘Maximum Independent Set’ (MIS) problem, which is concerned with determining a subset of vertices in a graph with maximum cardinality, such that no two vertices in this subset are connected by an edge [7]. In the past few decades, in addition to commercial Integer Programming (IP) solvers (e.g., CPLEX [8], Gurobi [9], and most recently CP-SAT [10]), powerful heuristic methods (e.g., ReduMIS in [3]) have been introduced to tackle the complexities inherent in the MIS problem. Notably, a plethora of data-driven machine learning approaches were proposed for solving the MIS problem [11, 12, 13]. These methods fall into two categories: Supervised Learning (SL) approaches and Reinforcement Learning (RL) approaches.
However, both data-driven SL and RL approaches are known for their unsatisfactory generalization performance when faced with graph instances exhibiting structural characteristics different from those in the training dataset [12] (see Section 2.2 for further discussion). Additionally, the number of training parameters in learning-based methods is significantly larger than our approach. For instance, the network used in the most recent state-of-the-art (SOTA) method, DIFUSCO [14], consists of 12 layers, each with five trainable weight matrices of dimensions , resulting in nearly four million trainable parameters for the SATLIB dataset, which has graphs of at most nodes. By comparison, our approach would require trainable parameters. Moreover, many of these existing methods achieve SOTA results only when employing various MIS-specific subroutines, as thoroughly analyzed and elucidated in the recent work by [12]. The limitations of these data-dependent methods lead to an open question:
In this paper, we answer this question affirmatively by proposing a dataless quadratic neural network approach (dQNN). Collectively, our proposed dataless neural network offers a novel neural-network-based method for addressing combinatorial optimization problems without the need for any training data, hence resolving the out-of-distribution generalization challenges in existing learning-based methods. More concretely, given the MIS instance , we encode into the parameters of a neural network such that those parameters yield the solution to the given MIS problem after training. Such neural architecture designs can further benefit from GPU implementations with massive parallelism when compared to classical non-learning methods. The main contributions of our work are summarized as follows:
-
•
We first propose dataless quadratic networks (-) that encode the input graph and its complement. The neural architecture of - implicitly defines a continuous and differentiable relaxation of the MIS problem, thereby enabling an efficient optimization process and paving the way for enhanced performance for solving the MIS problem.
-
•
To improve the exploration of - for solving the MIS problem, we propose three initialization schemes: (i) a sampling from the uniform distribution when the degrees of all nodes are similar, (ii) a sampling scheme based on a continuous semidefinite programming (SDP) relaxation of the MIS problem for sparse graphs, and (iii) a degree-based initialization scheme for dense graphs.
-
•
We provide a theoretical analysis on sufficient and necessary conditions of the edges-penalty parameter for -. Furthermore, we provide theoretical insights on the local minimizers. These derivations shed light on the underlying dynamics of the optimization process associated with our proposed dataless neural network. The theoretical foundation strengthens the understanding of how these parameters influence the behavior of the optimization algorithm.
-
•
Our experiments on known challenging graph datasets, utilizing standard tools such as the Adam optimizer and GPU libraries, establish the efficacy of our proposed - approach, which shows competitive or superior performance compared to SOTA data-driven learning approaches.
2 Preliminaries and related work
2.1 The MIS problem formulations
Notations:
Consider an undirected graph represented as , where is the vertex set and is the edge set. The cardinality of a set is denoted by . The number of nodes (resp. edges) is denoted by (resp. ). Unless otherwise stated, for a node , we use to denote the set of its neighbors. The degree of a node is denoted by , and the maximum degree of the graph by . For a subset of nodes , we use to represent the subgraph induced by the nodes in , where . Given a graph , its complement is denoted by , where is the set of all the edges between nodes that are not connected in . Consequently, if , then represents the number of edges in the complete graph on . The graph adjacency matrix of graph is denoted by . We use to denote the identity matrix. The element-wise product of two matrices and is denoted by . The trace of a matrix is denoted by . We use to denote the diagonal of . For any positive integer , . The vector (resp. matrix) of all ones and size (resp. ) is denoted by (resp. ). Furthermore, we use to denote the indicator function that returns (resp. ) when its argument is True (resp. False).
Problem Statement:
In this paper, we consider the - problem of obtaining the maximum independent sets (MIS). Next, we formally define MIS and the complementary Maximum Clique (MC) problems.
Definition 1 (MIS Problem).
Given an undirected graph , the goal of MIS is to find a subset of vertices such that , and is maximized.
Definition 2 (MC Problem).
Given an undirected graph , the goal of MC is to find a subset of vertices such that is a complete graph, and is maximized.
For the MC problem, the MIS of a graph is an MC of the complement graph [1]. Let each entry of binary vector correspond to a node , and is denoted by . An integer linear program (ILP) for MIS can be formulated as follows [16]:
(1) |
The following quadratic integer program (QIP) in (2) (with an optimal solution that is equivalent to the optimal solution of the above ILP) can also be used to formulate the MIS problem [17]:
(2) |
Furthermore, the work in [18] introduced the following semidefinite programming (SDP) relaxation of the MIS problem.
(3) |
where the third constraint denotes the positive semi-definiteness on the optimization matrix . We denote the optimal solution of (3) by . The diagonal of represent the ‘weight’ of each vertex in being part of the MIS, whereas the off-diagonal non-zero elements, i.e., for indices , indicate how likely nodes and are in the MIS. While Eq. (3) is convex (which means there exists one unique optimal solution) and runs in polynomial time, obtaining the MIS from requires rounding techniques (such as spectral clustering [19]) that, in most cases, do not result in an optimal MIS in the graph.
2.2 Related work
1) Exact and Heuristic Solvers: Exact approaches for - problems typically rely on branch-and-bound global optimization techniques. However, exact approaches suffer from poor scalability, which limits their uses in large MIS problems [20]. This limitation has spurred the development of efficient approximation algorithms and heuristics. For instance, the well-known NetworkX library [21] implements a heuristic procedure for solving the MIS problem [6]. These polynomial-time heuristics often incorporate a mix of sub-procedures, including greedy algorithms, local search sub-routines, and genetic algorithms [22]. However, such heuristics generally cannot theoretically guarantee that the resulting solution is within a small factor of optimality. In fact, inapproximability results have been established for the MIS problem [23].
Among existing MIS heuristics, ReduMIS [3] has emerged as the leading approach. The ReduMIS framework contains two primary components: (i) an iterative application of various graph reduction techniques (e.g., the linear programming (LP) reduction method in [16]) with a stop** rule based on the non-applicability of these techniques; and (ii) an evolutionary algorithm. The ReduMIS algorithm initiates with a pool of independent sets and evolves them through multiple rounds. In each round, a selection procedure identifies favorable nodes by executing graph partitioning, which clusters the graph nodes into disjoint clusters and separators to enhance the solution. In contrast, our - approach does not require such complex algorithmic operations (e.g., solution combination operation, community detection, and local search algorithms for solution improvement) as used in ReduMIS. More importantly, ReduMIS and ILP solvers scale with the number of nodes and the number of edges (which constraints their application on highly dense graphs), whereas - only scales w.r.t. the number nodes, as will be demonstrated in our experimental results.
2) Data-Driven Learning-Based Solvers: As mentioned in Section 1, data-driven approaches for MIS problems can be classified into SL and RL methods. A notable SL method is proposed in [24], which combines several components including graph reductions [3], Graph Convolutional Networks (GCN) [25], guided tree search, and a solution improvement local search algorithm [26]. The GCN is trained on benchmark graphs using their solutions as ground truth labels, enabling the learning of probability maps for the inclusion of each vertex in the optimal solution. Then, a subset of ReduMIS subroutines is used to improve their solution. While the work in [24] reported on-par results to ReduMIS, it was later shown by [12] that replacing the GCN output with random values performs similarly to using the trained GCN network. Recently, DIFUSCO was introduced in [14], an approach that integrates Graph Neural Networks (GNNs) with diffusion models [27] to create a graph-based diffusion denoiser. DIFUSCO formulates the MIS problem in the discrete domain and trains a diffusion model to improve a single or a pool of solutions.
On the other hand, RL-based methods have achieved more success in solving the MIS problem when compared to SL methods. In the work of [28], a Deep Q-Network (DQN) is combined with graph embeddings, facilitating the discrimination of vertices based on their influence on the solution and ensuring scalability to larger instances. Meanwhile, the study presented in [29] introduces the Learning What to Defer (LwD) method, an unsupervised deep RL solver resembling tree search, where vertices are iteratively assigned to the independent set. Their model is trained using Proximal Policy Optimization (PPO) [30]. The work in [31] introduces DIMES, which combines a compact continuous space to parameterize the distribution of potential solutions and a meta-learning framework to facilitate the effective initialization of model parameters during the fine-tuning stage that is required for each graph.
It is worth noting that the majority of SL and RL methods are data-dependent in the sense that they require the training of a separate network for each dataset of graphs. These data-dependent methods exhibit limited generalization performance when applied to out-of-distribution graph data. This weak generalization stems from the need to train a different network for each graph dataset. In contrast, our dQNN approach differs from SL- and RL-based methods in that it does not rely on any training datasets. Instead, our dQNN approach utilizes a simple yet effective graph-encoded continuous objective function, which is defined solely in terms of the connectivity of a given graph.
3) Dataless Differentiable Methods: The most related work to ours is [32], which introduces dataless neural networks (dNNs) tailored for the MIS problem. Notably, their method operates without the need for training data and relies on trainable parameters. Their proposed methodology advocates using a ReLU-based continuous objective to solve the MIS problem. However, to scale up and improve their method, graph partitioning and local search algorithms were employed.
4) Discrete Sampling Solvers: In recent studies, researchers have explored the integration of energy-based models with parallel implementations of simulated annealing to address combinatorial optimization problems [33] without relying on any training data. For example, in tackling the Maximum Independent Set (MIS) problem, Sun et al. [34] proposed a solver that combines (i) Path Auxiliary Sampling [35] and (ii) the binary quadratic integer program in (2). However, unlike -, these approaches entail prolonged sequential runtime and require fine-tuning of multiple hyperparameters. Moreover, the energy models utilized in this method for addressing the MIS problem may generate binary vectors that violate the “no edges” constraint inherent to the MIS problem. Consequently, a post-processing procedure becomes necessary.
3 -: Dataless quadratic neural networks for the MIS problem
In this section, we introduce our dataless neural network, - , designed to solve the MIS problem through neural training without the need for any training data.
3.1 -: The model.
We will first present the model of our proposed - that is (i) differentiable everywhere w.r.t. , and (ii) free of hyper-parameter scheduling. A continuous relaxation of QIP (2) is
(4) |
Let denote a binary minimizer of Problem (4). Then, it was shown in [36] that it corresponds to an MIS of size . Based on the quadratic MIS formulation in Problem (4), our proposed - introduces several improvements and modifications to efficiently solve the MIS problem. In particular, - incorporates an edges-penalty parameter , which scales the influence of the edges of the graph on the optimization objective. Furthermore, - uses the adjacency matrices of and . To see how - is designed, we first consider the following -parameterized augmented quadratic formulation for the MIS problem:
(5) |
where is the edges-penalty parameter. The rationale behind the third augmented term in Problem (5) (corresponding to the edges of the complement graph ) is to encourage the optimizer to select two nodes with no edge connecting them in (implying an edge in ). We will theoretically show later in Theorem 4 that any MIS minimizer is a local minimizer of Problem (5) with an appropriately chosen -value. Interestingly, the above continuous quadratic formulation of the MIS problem in Problem (5) admits a dataless implementation of the quadratic neural network (QNN), which was recently introduced in [37].
To see how Problem (5) corresponds to a -, consider the graph example in Figure 1 (left), for which the - is illustrated in Figure 1 (right). Here, the - comprises two fully connected layers. The initial activation-free layer encodes information about the nodes (top connections), edges of (middle connections), and edges of (bottom connections), all without a bias vector. The subsequent fully connected layer is an activation-free layer performing a vector dot-product between the fixed weight vector (with corresponding to the nodes and edges of and the edges-penalty parameter ), and the output of the first layer.
Utilizing the SDP solution of Problem (3), along with its interpretation discussed in Section 2.2, - has the flexibility of incorporating into the objective in Problem (5) as follows:
(6) |
and represent the likelihood of nodes in the graph and complement graph , respectively, to be included in the MIS. This serves as our rationale behind the objective function in Problem (6).
Remark 1.
Despite the polynomial-time complexity of the SDP formulation in (3) and the availability of efficient solvers such as MOSEK [38], efficiently obtaining the optimal solution in (3) is predominantly achievable for sparse graphs. This constraint emerges from the fact that the computational complexity of SDP grows proportionally with both the number of nodes and the number of edges in the graph. Consequently, the practical applicability of utilizing the objective function in Problem (6) is limited to sparse graphs.
3.2 -: The training algorithm
Drawing from the objective function of Problem (5) and the network structure of -, we introduce an MIS training algorithm. Notably, in contrast to the ReLU-based dNN proposed in [32], the - formulation in Problem (5) is characterized by being fully differentiable across , enabling more numerically stable optimization [39], as demonstrated in our experimental findings discussed in Section 4.
Our objective functions in Problems (5) and (6) are highly non-convex which makes finding the global minimizer(s) a challenging task. The work in [40] details the complexity of box-constrained continuous quadratic optimization problems. Gradient-based optimizers like Adam [15] are effective for finding a local minimizer given an initialization in . Due to the full differentiability of our objective (Problems (5) and (6)), Adam empirically proves to be computationally highly efficient. Consequently, for a single graph, we can initiate multiple optimizations from various points in and execute Adam in parallel for each. Specifically, with a specified number of batches (parallel processes) , we define set to denote all the initializations, where , and consider the following three approaches:
-
•
Random Initialization: Here, each vector in set is obtained by sampling from each entry independently from the uniform distribution. This strategy is effective when the degree of each vertex is similar to all other vertices such as the Erdos-Renyi (ER) [41] graphs.
-
•
SDP-based Initialization: Given that the diagonal of represents the ‘weight’ of each vertex to be in the MIS, we propose using the SDP solution by which set consists of vector and samples drawn from a Gaussian distribution with mean vector and covariance matrix . Here, serves as a hyper-parameter that governs the exploration around . In this case, we use Problem (6). This strategy is effective particularly in scenarios where solving the SDP is computationally tractable, such as in the case of sparse graphs.
-
•
Degree-based Initialization: Following the intuition that vertices with higher degrees are less likely to belong to an MIS compared to those with lower degrees, we propose using set with samples drawn from a Gaussian distribution with mean vector obtained as and covariance matrix . This will be our choice when computing the SDP solution is computationally expensive, i.e., for dense graphs.
We outline the MIS training procedure for - in Algorithm 1. As shown, the algorithm takes a graph , a set of initializations , the maximum number of iterations per batch (with iteration index ), an edge-penalty parameter , and Adam learning rate as inputs. For each batch and iteration , the Adam optimizer updates (Line 4). In Line 5, a projection onto is employed. In Lines 6 and 7, the algorithm checks whether the updated corresponds to an MIS in the graph. If yes, the algorithm stops for this batch. Finally, the best-found MIS, determined by its cardinality, is returned in Line 8. The blue font is used to indicate the case when we use the SDP initialization.
3.3 -: Theoretical foundation
Here, we provide the necessary and sufficient condition on the -value for any MIS to correspond to local minimizers of Problem (5). Moreover, we also provide a sufficient condition for all local minimizers of Problem (5) to be associated with a MIS. We relegate the proofs to Appendix A.
Definition 3 (MIS vector).
Given a graph , a binary vector is called a MIS vector if there exists a MIS of such that for all , and for all .
Theorem 4 (Necessary and Sufficient Condition on for MIS vectors to be local minimizers of Problem (5)).
Remark 2.
Next, we provide a stronger condition on that ensures all local minimizers of Problem (5) correspond to a MIS.
Theorem 5 (- Local Minimizers).
Given graph and set , all local minimizers of (5) are MIS vectors of .
Remark 3.
The assumption in Theorem 5 is stronger than that in Theorem 4 (as for any graph with (non-empty graph), we have ). The trade-off of choosing a large -value is that while larger values of () ensure that only MIS are local minimizers, they also increase the non-convexity of the optimization problem, thereby making it more difficult to solve.
Remark 4.
Although the proposed constrained quadratic Problem (5) is still NP-hard to solve for the global minimizer, it is a relaxation of the original integer programming problem. It can leverage gradient information, allowing the use of high-performance computational resources and parallel processing to enhance the efficiency and scalability of our approach.
4 Experimental results
1) Settings, Baselines, & Benchmarks: Graphs are processed using the NetworkX library [21]. For baselines, we utilize Gurobi [9] and the recent Google solver CP-SAT [10] for the ILP in (1), ReduMIS [3], iSCO111https://github.com/google-research/discs [34], and four learning-based methods: DIMES [31], DIFUSCO [14], LwD [29], and the GCN method in [24] (commonly referred to as ‘Intel’). We note that, following the analysis in [12], GCN’s code cloning to ReduMIS is disabled, which was also done in [14]. Aligned with recent SOTA methods (DIMES, DIFUSCO, and iSCO), we employ the Erdos-Renyi (ER) [41] graphs from [31] and the SATLIB graphs [43] as benchmarks. The ER dataset222https://github.com/DIMESTeam/DIMES consists of graphs with to nodes and , where is the probability of edge creation. The SATLIB dataset consists of graphs (with at most nodes and edges). Additionally, the GNM random graph generator function of NetworkX is utilized for our scalability experiment. For -, the edges-penalty parameter is selected as . The initial learning rate is for ER, 0.9 for SATLIB, and 0.5 for GNM. The number of iterations per initialization, , is set to for ER, for SATLIB, and 350 for GNM. The exploration parameter for SDP-based and Degree-based initialization is set to . Our code333https://anonymous.4open.science/r/Quant-Net/README.md uses PyTorch [44] to construct the objective function in - , and PyTorch’s implementation of Adam to optimize. Further implementation details and results are provided in Appendix B and Appendix D, respectively.
2) ER and SATLIB Benchmark Results:
Here, we present the results of -, along with the other considered baselines using the SATLIB (Table 1(a)) and ER (Table 1(b)) benchmarks in terms of average MIS size over the graphs in the dataset and the total sequential run-time required to obtain the results for all the graphs. We note that the results of the learning-based methods are sourced from [14]. In what follows, we provide observations on these results.
-
•
All learning-based methods, except for GCN, require training a separate network for each graph dataset, as indicated in the third column of Table 1(a) and Table 1(b). This illustrates the generalization limitations of these methods. In contrast, our method is more generalizable, as it only requires tuning a few hyper-parameters for each set of graphs.
-
•
When compared to learning-based approaches, on ER (resp. SATLIB), our method outperforms all (resp. most) baseline methods, all without requiring any training data. Run-time comparison with these methods is not considered, as the reported numbers exclude training time, which may vary depending on multiple factors such as graph size, available compute, number of data points, and the used neural network architecture. Furthermore, our approach does not rely on additional techniques such as Greedy Decoding [45] and Monte Carlo Tree Search [46].
-
•
When compared to iSCO, our method reports almost similar MIS size for SATLIB, while falling by nearly two nodes on ER. Nevertheless, our method requires significantly reduced sequential run time. It is important to note that the iSCO paper [34] reports a lower run time as compared to other methods. This reported run time is achieved by evaluating the test graphs in parallel, in contrast to all other methods that evaluated them sequentially. To fairly compare methods in our experiments, we opted to report sequential test run time only. The extended sequential run-time of iSCO, compared to its parallel run-time, is due to its use of simulated annealing. Because simulated annealing depends on knowing the energy of the previous step when determining the next step, it is inherently more efficient for iSCO to solve many graphs in parallel than in series.
-
•
For SATLIB, which consists of highly sparse graphs, on average, - falls short by a few nodes when compared to ReduMIS, Gurobi, and CP-SAT. The reason ReduMIS achieves SOTA of results here is that a large set of MIS-specific graph reductions can be applied. However, for denser graphs, most of these graph reductions are not applicable. Gurobi (and CP-SAT) solves the ILP in (1) by which the number of constraints is equal to the number of edges in the graph. This means that Gurobi and CP-SAT are expected to perform much better in sparse graph such as SATLIB.
-
•
On ER, when compared to Gurobi and CP-SAT, our method not only reports a larger average MIS size but also requires less than half the run-time. This is because ER is relatively denser compared to SATLIB. As a result, when run for 64 minutes on ER, Gurobi and CP-SAT fall short compared to our method and ReduMIS, while reporting the same average MIS as ReduMIS for SATLIB.
Method | Type | Training Data | MIS Size | Total Run-time (m) | Run-time Comment |
ReduMIS | Heuristics | Not Required | 425.96 | 37.58 | Run until completion |
CP-SAT | Exact | Not Required | 425.96 | 0.78 | Run until completion |
Gurobi | Exact | Not Required | 425.96 | 8.16 | Run until completion |
GCN | SL+G | SATLIB | 420.66 | 23.05 | This excludes training time |
LwD | RL+S | SATLIB | 422.22 | 18.83 | This excludes training time |
DIMES | RL+TS | SATLIB | 422.22 | 18.83 | This excludes training time |
DIMES | RL+S | SATLIB | 423.28 | 20.26 | This excludes training time |
DIFUSCO | SL+G | SATLIB | 424.5 | 8.76 | This excludes training time |
DIFUSCO | SL+S | SATLIB | 425.13 | 23.74 | This excludes training time |
iSCO | Sampling | Not Required | 423.7 | 7500 | Sequential runtime; original paper shows parallelized runtime |
- + SDP (Ours) | dQNN | Not Required | 423.22 | 89.69 | This excludes SDP time (30 seconds per graph) |
- + DI (Ours) | dQNN | Not Required | 423.03 | 64.9 | Run until completion |
Method | Type | Training Data | MIS Size | Total Run-time (m) | Run-time Comment |
ReduMIS | Heuristics | Not Required | 44.87 | 52.13 | Run until completion |
CP-SAT | Exact | Not Required | 41.09 | 64.00 | Run with 30 second time limit per graph |
Gurobi | Exact | Not Required | 39.19 | 64.00 | Run with 30 second time limit per graph |
GCN | SL+G | SATLIB | 34.86 | 6.06 | This excludes training time |
GCN | SL+TS | SATLIB | 38.8 | 20 | This excludes training time |
LwD | RL+S | ER | 41.17 | 6.33 | This excludes training time |
DIMES | RL+TS | ER | 38.24 | 6.12 | This excludes training time |
DIMES | RL+S | ER | 42.06 | 12.01 | This excludes training time |
DIFUSCO | SL+G | ER | 38.83 | 8.8 | This excludes training time |
DIFUSCO | SL+S | ER | 41.12 | 26.27 | This excludes training time |
iSCO | Sampling | Not Required | 44.8 | 384 | Sequential runtime; original paper shows parallelized runtime |
- + RI (Ours) | dQNN | Not Required | 43.52 | 21 | Run until completion |
3) Scalability Results:
It is well-established that relatively denser graphs pose greater computational challenges compared to sparse graphs. This observation diverges from the trends exhibited by other baselines, which predominantly excel on sparse graphs. We argue that this is due to the applicability of graph reduction techniques such as the LP reduction method in [16], and the unconfined vertices rule [47] (see [3] for a complete list of the graph reduction rules that apply only on sparse graphs). For instance, by simply applying the LP graph reduction technique, the large-scale highly sparse graphs (with several hundred thousand nodes), considered in Table 5 of [24], reduce to graphs of a few thousands nodes with often dis-connected sub-graphs that can be treated independently.
Therefore, the scalability and performance of ReduMIS are significantly dependent by the sparsity of the graph. This dependence emerges from the iterative application of various graph reduction techniques in ReduMIS, specifically tailored for sparse graphs. For instance, the ReduMIS results presented in Table 2 of [29] are exclusively based on extensive and highly sparse graphs. This conclusion is substantiated by both the sizes of the considered graphs and the corresponding sizes of the obtained MIS solutions. As such, in this experiment, we investigate the scalability of - for the MIS problem against the SOTA data-independent methods: ReduMIS, Gurobi, and CP-SAT. Here, we use randomly generated graphs with the GNM generator by which the number of edges is set to . It is important to note that the density of these graphs is significantly higher than those considered in previous works. This choice of the number of edges in the GNM function indicate that half of the total possible edges (w.r.t. the complete graph) exist.
Results are provided in Figure 2. As observed, for dense graphs, as the graph size increases, our method requires significantly less run-time (Figure 2(a)) compared to all baselines, while reporting almost the same average MIS size (Table 2(b)). For instance, when is , on average, our method requires around 1 minute to solve the 5 graphs, whereas other baselines require approximately 45 minutes or more to achieve the same MIS size. These results indicate that, unlike ReduMIS and ILP solvers, the run-time of our method scales only with the number of nodes in the graph, which is a significant improvement.
ReduMIS | Gurobi | CP-SAT | - (Ours) | |
---|---|---|---|---|
N/A | N/A | |||
N/A | N/A | |||
N/A | N/A |
5 Conclusion
This study addressed the challenging Maximum Independent Set (MIS) Problem within the domain of Combinatorial Optimization by introducing an innovative continuous formulation employing dataless quadratic neural networks. By eliminating the need for any training data, - sets itself apart from conventional learning approaches. Through the utilization of gradient-based optimization using ADAM and a GPU implementation, our straightforward yet effective approach demonstrates competitive performance compared to state-of-the-art learning-based and sampling-based methods. This research offers a distinctive perspective on approaching discrete optimization problems through parameter-efficient neural networks that are trained from the problem structure, not from datasets.
References
- Karp [1972] Richard M Karp. Reducibility among combinatorial problems. In Complexity of computer computations, pages 85–103. Springer, 1972.
- Bengio et al. [2021] Yoshua Bengio, Andrea Lodi, and Antoine Prouvost. Machine learning for combinatorial optimization: a methodological tour d’horizon. European Journal of Operational Research, 290(2):405–421, 2021.
- Lamm et al. [2016] Sebastian Lamm, Peter Sanders, Christian Schulz, Darren Strash, and Renato F Werneck. Finding near-optimal independent sets at scale. In 2016 Proceedings of the Eighteenth Workshop on Algorithm Engineering and Experiments (ALENEX), pages 138–150. SIAM, 2016.
- Akiba and Iwata [2016] Takuya Akiba and Yoichi Iwata. Branch-and-reduce exponential/fpt algorithms in practice: A case study of vertex cover. Theoretical Computer Science, 609:211–225, 2016.
- San Segundo et al. [2011] Pablo San Segundo, Diego Rodríguez-Losada, and Agustín Jiménez. An exact bit-parallel algorithm for the maximum clique problem. Computers & Operations Research, 38(2):571–581, 2011.
- Boppana and Halldórsson [1992] Ravi Boppana and Magnús M Halldórsson. Approximating maximum independent sets by excluding subgraphs. BIT Numerical Mathematics, 32(2):180–196, 1992.
- Tarjan and Trojanowski [1977] Robert Endre Tarjan and Anthony E Trojanowski. Finding a maximum independent set. SIAM Journal on Computing, 6(3):537–546, 1977.
- [8] IBM. IBM ILOG CPLEX Optimization Studio. URL https://www.ibm.com/products/ilog-cplex-optimization-studio.
- [9] Gurobi. Gurobi Optimization. URL https://www.gurobi.com.
- Google, Inc. [2022] Google, Inc. Google or-tools. 2022. URL https://developers.google.com/optimization.
- He et al. [2014] He He, Hal Daume III, and Jason M Eisner. Learning to search in branch and bound algorithms. Advances in neural information processing systems, 27:3293–3301, 2014.
- Böther et al. [2022] Maximilian Böther, Otto Kißig, Martin Taraz, Sarel Cohen, Karen Seidel, and Tobias Friedrich. What’s wrong with deep learning in tree search for combinatorial optimization. arXiv preprint arXiv:2201.10494, 2022.
- Dong et al. [2021] Yuanyuan Dong, Andrew V Goldberg, Alexander Noe, Nikos Parotsidis, Mauricio GC Resende, and Quico Spaen. New instances for maximum weight independent set from a vehicle routing application. In Operations Research Forum, volume 2, pages 1–6. Springer, 2021.
- Sun and Yang [2023] Zhiqing Sun and Yiming Yang. Difusco: Graph-based diffusion solvers for combinatorial optimization. arXiv preprint arXiv:2302.08224, 2023.
- Kingma and Ba [2015] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR (Poster), 2015.
- Nemhauser and Trotter [1975] George L Nemhauser and Leslie Earl Trotter. Vertex packings: Structural properties and algorithms. Mathematical Programming, 8(1):232–248, 1975.
- Pardalos and Rodgers [1992] Panos M Pardalos and Gregory P Rodgers. A branch and bound algorithm for the maximum clique problem. Computers & operations research, 19(5):363–375, 1992.
- Lovász [1979] László Lovász. On the shannon capacity of a graph. IEEE Transactions on Information theory, 25(1):1–7, 1979.
- Von Luxburg [2007] Ulrike Von Luxburg. A tutorial on spectral clustering. Statistics and computing, 17(4):395–416, 2007.
- Dai et al. [2016] Hanjun Dai, Bo Dai, and Le Song. Discriminative embeddings of latent variable models for structured data. In International conference on machine learning, pages 2702–2711. PMLR, 2016.
- Hagberg et al. [2008] Aric A. Hagberg, Daniel A. Schult, and Pieter J. Swart. Exploring network structure, dynamics, and function using networkx. In Gaël Varoquaux, Travis Vaught, and Jarrod Millman, editors, Proceedings of the 7th Python in Science Conference, pages 11 – 15, Pasadena, CA USA, 2008.
- Williamson and Shmoys [2011] David P Williamson and David B Shmoys. The design of approximation algorithms. Cambridge university press, 2011.
- Berman and Schnitger [1992] Piotr Berman and Georg Schnitger. On the complexity of approximating the independent set problem. Information and Computation, 96(1):77–94, 1992.
- Li et al. [2018] Zhuwen Li, Qifeng Chen, and Vladlen Koltun. Combinatorial optimization with graph convolutional networks and guided tree search. In NeurIPS, 2018.
- Defferrard et al. [2016] Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst. Convolutional neural networks on graphs with fast localized spectral filtering. Advances in neural information processing systems, 29:3844–3852, 2016.
- Andrade et al. [2012] Diogo V Andrade, Mauricio GC Resende, and Renato F Werneck. Fast local search for the maximum independent set problem. Journal of Heuristics, 18(4):525–547, 2012.
- Ho et al. [2020] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
- Dai et al. [2017] Hanjun Dai, Elias B Khalil, Yuyu Zhang, Bistra Dilkina, and Le Song. Learning combinatorial optimization algorithms over graphs. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pages 6351–6361, 2017.
- Ahn et al. [2020] Sungsoo Ahn, Younggyo Seo, and **woo Shin. Learning what to defer for maximum independent sets. In International Conference on Machine Learning, pages 134–144. PMLR, 2020.
- Schulman et al. [2017] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
- Qiu et al. [2022] Ruizhong Qiu, Zhiqing Sun, and Yiming Yang. Dimes: A differentiable meta solver for combinatorial optimization problems. Advances in Neural Information Processing Systems, 35:25531–25546, 2022.
- Alkhouri et al. [2022] Ismail R Alkhouri, George K Atia, and Alvaro Velasquez. A differentiable approach to the maximum independent set problem using dataless neural networks. Neural Networks, 155:168–176, 2022.
- Goshvadi et al. [2024] Katayoon Goshvadi, Haoran Sun, Xingchao Liu, Azade Nova, Ruqi Zhang, Will Grathwohl, Dale Schuurmans, and Hanjun Dai. Discs: A benchmark for discrete sampling. Advances in Neural Information Processing Systems, 36, 2024.
- Sun et al. [2023] Haoran Sun, Katayoon Goshvadi, Azade Nova, Dale Schuurmans, and Hanjun Dai. Revisiting sampling for combinatorial optimization. In International Conference on Machine Learning, pages 32859–32874. PMLR, 2023.
- Sun et al. [2021] Haoran Sun, Hanjun Dai, Wei Xia, and Arun Ramamurthy. Path auxiliary proposal for mcmc in discrete space. In International Conference on Learning Representations, 2021.
- Mahdavi Pajouh et al. [2013] Foad Mahdavi Pajouh, Balabhaskar Balasundaram, and Oleg A Prokopyev. On characterization of maximal independent sets via quadratic optimization. Journal of Heuristics, 19:629–644, 2013.
- Fan et al. [2020] Fenglei Fan, **jun Xiong, and Ge Wang. Universal approximation with quadratic deep networks. Neural Networks, 124:383–392, 2020.
- [38] MOSEK ApS. MOSEK: Optimization software. https://www.mosek.com.
- Liu et al. [2021] Bo Liu, Zhaoying Liu, Ting Zhang, and Tongtong Yuan. Non-differentiable saddle points and sub-optimal local minima exist for deep relu networks. Neural Networks, 144:75–89, 2021.
- Burer and Letchford [2009] Samuel Burer and Adam N Letchford. On nonconvex quadratic programming with box constraints. SIAM Journal on Optimization, 20(2):1073–1089, 2009.
- Erdos et al. [1960] Paul Erdos, Alfréd Rényi, et al. On the evolution of random graphs. Publ. Math. Inst. Hung. Acad. Sci, 5(1):17–60, 1960.
- Wei [1981] Victor K Wei. A lower bound on the stability number of a simple graph. Technical report, Bell Laboratories Technical Memorandum Murray Hill, NJ, USA, 1981.
- Hoos and Stützle [2000] Holger H Hoos and Thomas Stützle. Satlib: An online resource for research on sat. Sat, 2000:283–292, 2000.
- [44] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch. https://pytorch.org/.
- Graikos et al. [2022] Alexandros Graikos, Nikolay Malkin, Nebojsa Jojic, and Dimitris Samaras. Diffusion models as plug-and-play priors. Advances in Neural Information Processing Systems, 35:14715–14728, 2022.
- Fu et al. [2021] Zhang-Hua Fu, Kai-Bin Qiu, and Hongyuan Zha. Generalize a small pre-trained model to arbitrarily large tsp instances. In Proceedings of the AAAI conference on artificial intelligence, volume 35, pages 7474–7482, 2021.
- Xiao and Nagamochi [2013] Mingyu Xiao and Hiroshi Nagamochi. Confining sets and avoiding bottleneck cases: A simple maximum independent set algorithm in degree-3 graphs. Theoretical Computer Science, 469:92–104, 2013.
Appendix A Proofs
Let’s begin by re-stating our main optimization problem:
(7) |
The gradient of (7) is:
(8) |
For some , we have
(9) |
A.1 Proof of Theorem 4
Proof.
Let be an MIS. Define the vector such that it contains ’s at positions corresponding to the nodes in the set , and ’s at all other positions. For any MIS to be a local minimizer of Problem (6), it is sufficient and necessary to require that
(10) | |||
(11) |
Here, is the element of at the position corresponding to the node . (10) is derived because if , then (by the definition of ) so it is at the left boundary of the interval . For the left boundary point to be a local minimizer, it requires the derivative to be non-negative (i.e., moving towards the right only increases the objective). Similarly, when , , is at the right boundary for (11), at which the derivative should be non-positive.
The derivative of computed in (9) can be rewritten as
(12) |
where is the number of neighbours of in and is the number of non-neighbours of in i.e., where . By this definition, we immediately have and , where the upper and lower bounds for and are all attainable by some special graphs. Note that the lower bound of is , and that is due the fact that is a MIS, so any other node (say ) will have at least edge connected to a node in .
Plugging (12) into (10), we obtain
(13) |
Since we’re seeking a universal for all the graphs, we must set to its lowest possible value, , and to its highest possible value (both are attainable by some graphs), and still requires to satisfy (13). This means it is necessary and sufficient to require . In addition, (11) is satisfied unconditionally and therefore does not impose any extra condition on . ∎
A.2 Proof of Theorem 5
Lemma 6.
All local minimizers of Problem (7) are binary vectors.
Proof.
Let be any local minimizer of (7), if all the coordinates of are either 0 or 1, then is binary and the proof is complete, otherwise, at least one coordinate of is in the interior and we aim to prove that this is not possible (i.e. such a non-binary cannot exist as a minimizer) by contradiction. We assume the non-binary exists, and denote the set of non-binary coordinates as
(14) |
Since is non-binary, . Since the objective function of (7) is twice differentiable with respect to all with , then a necessary condition for to be a local minimizer is that
where is the vector restricted to the index set , and is the matrix whose row and column indices are both restricted to the set .
However, the second necessary condition cannot hold. Because if it does, then we must have (the trace cannot strictly equal to 0 as ). However, on the other hand, we have
as the diagonal entries of and are all 0, which leads to a contradiction. Here denotes the identity matrix with row indices restricted to the index set . ∎
Theorem 7 (Re-statement of Theorem 5).
Given graph and set , all local minimizers of (5) correspond to an MIS in .
Proof.
By lemma 6, we can only consider binary vectors as local minimizers. With this, we first prove that all local minimizers are Independent Sets (ISs). Then, we show that any IS, that is not a maximal IS, is not a local minimizer.
-
•
Here, we show that any local minimizer is an IS. By contradiction, assume that vector , by which such that (a binary vector with an edge in ), is a local minimizer. Since is at the right boundary of the interval , for it to be a local minimizer, we must have . Together with (9), this implies
(15) Re-arranging (15) and using yields to
(16) Given that , the condition in (16) can not be satisfied even if the LHS attains its minimum value (which is ) and the RHS attains a maximum value. The maximum possible value of the RHS is , where is the degree of node in , and the maximum possible value of is . This means that when an edge exists in , it can not be a fixed point. Thus, only ISs are local minimizers.
-
•
Here, we show that Independent Sets that are not maximal are not local minimizers. Define vector that corresponds to an IS . This means that there exists a node that is not in the IS and is not in the neighbor set of all nodes in the IS. Formally, if there exists such that , then is an IS, not a maximal IS. Note that such an satisfies and
(17) which implies increasing can further decrease the function value, contradicting to being a local minimizer. In (17), the second summation is as , which results in that is always negative. Thus, any binary vector that corresponds to an IS that is not maximal is not a local minimizer.
∎
Remark 5.
The above theorem implies that although there still exist non-binary stationary points, they are saddle points instead of local minimizers. Methods with momentum such as ADAGRAD and Adam can usually break out of saddle points and land on local minimizers.
Appendix B Further Implementation Details
In Algorithm 1, we use as the set of initializations we would like to solve. To solve this set of initializations, we choose batch size and number of batches , such that . If we choose a large batch size , then we increase the time to first solution. Conversely, if we choose a large number of batches , then we decrease the number of initializations explored in a given batch, and potentially delay better solutions. Because of this relationship, and need to be chosen carefully. Our is 128 for SATLIB, 256 for ER, and 1024 for GNM. Our is 40 for SATLIB, 28 for ER, 10 for the GNM convergence results, and 5 for the GNM scalability results.
The results of - and the baselines were obtained across three different machine configurations, with the runtime of the fastest configuration being reported. The NVIDIA A100 80GB PCIe machine utilizes an AMD EPYC 9554 CPU (30 vCores) with 236 GBs of DDR5-4800. The NVIDIA RTX4090 24GB machine utilizes an AMD EPYC 75F3 CPU (16 vCores) with 64 GBs of DDR4-2800. The NVIDIA RTX3070 8GB machine utilizes an Intel i9 12900K with 64 GBs of DDR5-6000. Note: the i9 has Intel Hyper-Threading and E-Cores disabled to maximize single core performance.
B.1 Efficient Implementation of MIS Checking
Based on the characteristics of the local minimizers of Problem (5), discussed in Lemma 6 and Theorem 5, we propose an efficient implementation to check whether a vector corresponds to a MIS. This means Line 7 in Algorithm 1.
We note that we need to check two subsequent conditions. The first is whether a binary vector corresponds to an IS (no nodes in IS contain any edges), and the second is whether this IS is maximal.
Given a vector , we obtain a binary representation of , denoted by denoted by , such that for all , if , and otherwise.
Based on the results of Appendix A, for some , our MIS checking involves verifying whether the following equality is True.
(18) |
In (18), a simple projected gradient descent step to check whether is at the boundary. We note that, computationally, this only requires a matrix-vector multiplication. Compared to the traditional method, which iterates over all the nodes in the MIS to check their neighbours, using (18) is 8X faster.
Appendix C Requirements Comparison with Baselines
In Table 2, we provide an overview comparison of the number of trainable parameters, hyper-parameters, and additional techniques needed for each baseline.
ReduMIS depends on a large set of graph reductions (see Section 3.1 in [3]) and graph clustering, which is used for solution improvement.
For learning-based methods, the parameters of a neural network architecture are optimized during training. This architecture is typically much larger than the number of input coordinates (). For instance, the network used in DIFUSCO consists of 12 layers, each with 5 trainable weight matrices. Each weight matrix is , resulting in trainable parameters for the SATLIB dataset (which has at most 1347 nodes).
Moreover, this dependence on training a NN introduces several hyper-parameters such as the number of layers, size of layers, choice of activation functions, etc.
It’s important to note that the choice of the sampler in iSCO introduces additional hyper-parameters. For instance, the PAS sampler [35] used in iSCO depends on the choice of the neighborhood function, a prior on the path length, and the choice of the probability of acceptance.
In terms of the number of optimization variables, - only requires variables and a much-reduced number of hyper-parameters compared to iSCO.
Appendix D Additional Results
D.1 Convergence Plots with Fixed Run-time
In this section, we conduct an additional experiment to highlight the effectiveness of our proposed approach. Specifically, we use four ER (resp. GNM) graphs with (resp. ) and run our method and each baseline for a fixed run time of 14 (resp. 12) seconds. We then show the progress of the best obtained MIS over time. Figure 3 and Figure 4 present the results.
As observed, our method finds very good solutions early in the optimization process. For ER, within the 14-second time budget, we outperform the ILP commercial solvers in almost all cases. Additionally, ReduMIS takes 4 to 7 seconds to generate the first solution, whereas - produces a good solution within the first second. For GNM, most methods reaches the 12 nodes mark. However, our method reaches the 12 node solution within the first or second second.
These convergence plots provide additional evidence of the scalability of - .
Method | Size | Hyper-Parameters | Additional Techniques/Procedures |
ReduMIS | variables | N/A | Many graph reductions, and graph clustering |
Gurobi | variables | N/A | N/A |
CP-SAT | variables | N/A | N/A |
GCN | trainable parameters | Many as it is learning-based | Tree Search |
LwD | trainable parameters | Many as it is learning-based | Entropy Regularization |
DIMES | trainable parameters | Many as it is learning-based | Tree Search or Sampling Decoding |
DIFUSCO | trainable parameters | Many as it is learning-based | Greedy Decoding or Sampling Decoding |
iSCO | variables | Temperature, Sampler, Chain length | Post Processing for Correction |
- | trainable parameters | Learning rate, exploration parameter , number of steps | Optional SDP initialization |
D.2 Impact of the Compliment Graph term in -
In this subsection, we demonstrate the impact of incorporating the proposed compliment graph term. Specifically, we use three GNM graphs with and run Algorithm 1 for iterations, both with () and without (, similar to iSCO [34]) the complement graph term. Each time a solution is found, we sample from the uniform distribution and optimize using Adam until the iterations are complete. For both cases, we used an initial learning rate of 0.5. The results are presented in Figure 5.
As shown, when the third term is included, our algorithm finds larger MISs while requiring fewer iterations (the first three plots). The fourth plot illustrates the number of MISs found (which may not be unique) with and without the third term across the three graph instances (x-axis). It is evident that including the third term results in finding more than 100 solutions, whereas disabling the third term yields fewer than 5 solutions within the 10,000 iterations. This indicates that, given one initialization, utilizing the third term significantly accelerates the optimizer’s convergence to a local minimizer. Furthermore, fast convergence means that the number of initializations in the search sparse also increases which yields to improving the exploration.