Dataless Quadratic Neural Networks for the Maximum Independent Set Problem

Ismail Alkhouri^1,2, Cedric Le Denmat³, Yingjie Li⁴, Cunxi Yu⁴, Jia Liu³,
Rongrong Wang¹, Alvaro Velasquez⁵
¹Michigan State University
²University of Michigan Ann Arbor,
³Ohio State University,
⁴University of Maryland College Park,
⁵University of Colorado Boulder

Abstract

Combinatorial Optimization (CO) plays a crucial role in addressing various significant problems, among them the challenging Maximum Independent Set (MIS) problem. In light of recent advancements in deep learning methods, efforts have been directed towards leveraging data-driven learning approaches, typically rooted in supervised learning and reinforcement learning, to tackle the NP-hard MIS problem. However, these approaches rely on labeled datasets, exhibit weak generalization, and often depend on problem-specific heuristics. Recently, ReLU-based dataless neural networks were introduced to address combinatorial optimization problems. This paper introduces a novel dataless quadratic neural network formulation, featuring a continuous quadratic relaxation for the MIS problem. Notably, our method eliminates the need for training data by treating the given MIS instance as a trainable entity. More specifically, the graph structure and constraints of the MIS instance are used to define the structure and parameters of the neural network such that training it on a fixed input provides a solution to the problem, thereby setting it apart from traditional supervised or reinforcement learning approaches. By employing a gradient-based optimization algorithm like ADAM and leveraging an efficient off-the-shelf GPU parallel implementation, our straightforward yet effective approach demonstrates competitive or superior performance compared to state-of-the-art learning-based methods. Another significant advantage of our approach is that, unlike exact and heuristic solvers, the running time of our method scales only with the number of nodes in the graph, not the number of edges.

1 Introduction

In his landmark paper [1], Richard Karp introduced the concept of reducibility among combinatorial problems that are complete for the complexity class $\mathsf{NP}$ . This pivotal work established a connection between combinatorial optimization problems and the $\mathsf{NP}$ - $\mathsf{hard}$ complexity class, implying their inherent computational challenges. Although these problems are notorious for their intractability, they have proven to be foundational in various sectors [2], demonstrating their widespread applicability. While a polynomial-time solver remains elusive for solving $\mathsf{NP}$ - $\mathsf{hard}$ problems with respect to (w.r.t.) the input size, various efficient solvers have been developed [3]. Such solvers can be broadly classified into heuristic algorithms [4], branch-and-bound-based global optimization methods [5], and approximation algorithms [6].

In the $\mathsf{NP}$ - $\mathsf{hard}$ complexity class, one of the most fundamental problems is the ‘Maximum Independent Set’ (MIS) problem, which is concerned with determining a subset of vertices in a graph $G=(V,E)$ with maximum cardinality, such that no two vertices in this subset are connected by an edge [7]. In the past few decades, in addition to commercial Integer Programming (IP) solvers (e.g., CPLEX [8], Gurobi [9], and most recently CP-SAT [10]), powerful heuristic methods (e.g., ReduMIS in [3]) have been introduced to tackle the complexities inherent in the MIS problem. Notably, a plethora of data-driven machine learning approaches were proposed for solving the MIS problem [11, 12, 13]. These methods fall into two categories: Supervised Learning (SL) approaches and Reinforcement Learning (RL) approaches.

However, both data-driven SL and RL approaches are known for their unsatisfactory generalization performance when faced with graph instances exhibiting structural characteristics different from those in the training dataset [12] (see Section 2.2 for further discussion). Additionally, the number of training parameters in learning-based methods is significantly larger than our approach. For instance, the network used in the most recent state-of-the-art (SOTA) method, DIFUSCO [14], consists of 12 layers, each with five trainable weight matrices of dimensions $256\times 256$ , resulting in nearly four million trainable parameters for the SATLIB dataset, which has graphs of at most $1347$ nodes. By comparison, our approach would require $n=1347$ trainable parameters. Moreover, many of these existing methods achieve SOTA results only when employing various MIS-specific subroutines, as thoroughly analyzed and elucidated in the recent work by [12]. The limitations of these data-dependent methods lead to an open question:

In this paper, we answer this question affirmatively by proposing a dataless quadratic neural network approach (dQNN). Collectively, our proposed dataless neural network offers a novel neural-network-based method for addressing combinatorial optimization problems without the need for any training data, hence resolving the out-of-distribution generalization challenges in existing learning-based methods. More concretely, given the MIS instance $G=(V,E)$ , we encode $G$ into the parameters of a neural network such that those parameters yield the solution to the given MIS problem after training. Such neural architecture designs can further benefit from GPU implementations with massive parallelism when compared to classical non-learning methods. The main contributions of our work are summarized as follows:

•

We first propose dataless quadratic networks ( $\mathsf{Quant}$ - $\mathsf{Net}$ ) that encode the input graph and its complement. The neural architecture of $\mathsf{Quant}$ - $\mathsf{Net}$ implicitly defines a continuous and differentiable relaxation of the MIS problem, thereby enabling an efficient optimization process and paving the way for enhanced performance for solving the MIS problem.
•

To improve the exploration of $\mathsf{Quant}$ - $\mathsf{Net}$ for solving the MIS problem, we propose three initialization schemes: (i) a sampling from the uniform distribution when the degrees of all nodes are similar, (ii) a sampling scheme based on a continuous semidefinite programming (SDP) relaxation of the MIS problem for sparse graphs, and (iii) a degree-based initialization scheme for dense graphs.
•

We provide a theoretical analysis on sufficient and necessary conditions of the edges-penalty parameter for $\mathsf{Quant}$ - $\mathsf{Net}$ . Furthermore, we provide theoretical insights on the local minimizers. These derivations shed light on the underlying dynamics of the optimization process associated with our proposed dataless neural network. The theoretical foundation strengthens the understanding of how these parameters influence the behavior of the optimization algorithm.
•

Our experiments on known challenging graph datasets, utilizing standard tools such as the Adam optimizer and GPU libraries, establish the efficacy of our proposed $\mathsf{Quant}$ - $\mathsf{Net}$ approach, which shows competitive or superior performance compared to SOTA data-driven learning approaches.

2 Preliminaries and related work

2.1 The MIS problem formulations

Notations:

Consider an undirected graph represented as $G=(V,E)$ , where $V$ is the vertex set and $E\subseteq V\times V$ is the edge set. The cardinality of a set is denoted by $|\cdot|$ . The number of nodes (resp. edges) is denoted by $|V|=n$ (resp. $|E|=m$ ). Unless otherwise stated, for a node $v\in V$ , we use $\mathcal{N}(v)=\{u\in V\mid(u,v)\in E\}$ to denote the set of its neighbors. The degree of a node $v\in V$ is denoted by $\textrm{d}(v)=|\mathcal{N}(v)|$ , and the maximum degree of the graph by $\Delta(G)$ . For a subset of nodes $U\subseteq V$ , we use $G[U]=(U,E[U])$ to represent the subgraph induced by the nodes in $U$ , where $E[U]=\{(u,v)\in E\mid u,v\in U\}$ . Given a graph $G$ , its complement is denoted by $G^{\prime}=(V,E^{\prime})$ , where $E^{\prime}=V\times V\setminus E$ is the set of all the edges between nodes that are not connected in $G$ . Consequently, if $|E^{\prime}|=m^{\prime}$ , then $m+m^{\prime}=n(n-1)/2$ represents the number of edges in the complete graph on $V$ . The graph adjacency matrix of graph $G$ is denoted by $\mathbf{A}_{G}\in\{0,1\}^{n\times n}$ . We use $\mathbf{I}$ to denote the identity matrix. The element-wise product of two matrices $\mathbf{A}$ and $\mathbf{B}$ is denoted by $\mathbf{A}\circ\mathbf{B}$ . The trace of a matrix $\mathbf{A}$ is denoted by $\mathrm{tr}(\mathbf{A})$ . We use $\mathrm{diag}(\mathbf{A})$ to denote the diagonal of $\mathbf{A}$ . For any positive integer $n$ , $[n]:=\{1,\ldots,n\}$ . The vector (resp. matrix) of all ones and size $n$ (resp. $n\times n$ ) is denoted by $\mathbf{e}_{n}$ (resp. $\mathbf{J}_{n}$ ). Furthermore, we use $\mathds{1}(\cdot)$ to denote the indicator function that returns $1$ (resp. $0$ ) when its argument is True (resp. False).

Problem Statement:

In this paper, we consider the $\mathsf{NP}$ - $\mathsf{hard}$ problem of obtaining the maximum independent sets (MIS). Next, we formally define MIS and the complementary Maximum Clique (MC) problems.

Definition 1 (MIS Problem).

Given an undirected graph $G=(V,E)$ , the goal of MIS is to find a subset of vertices $\mathcal{I}\subseteq V$ such that $E([\mathcal{I}])=\emptyset$ , and $|\mathcal{I}|$ is maximized.

Definition 2 (MC Problem).

Given an undirected graph $G=(V,E)$ , the goal of MC is to find a subset of vertices $C\subseteq V$ such that $G[C]$ is a complete graph, and $|C|$ is maximized.

For the MC problem, the MIS of a graph is an MC of the complement graph [1]. Let each entry of binary vector $\mathbf{z}\in\{0,1\}^{n}$ correspond to a node $v\in V$ , and is denoted by $\mathbf{z}_{v}\in\{0,1\}$ . An integer linear program (ILP) for MIS can be formulated as follows [16]:

\displaystyle\text{{\bf ILP:}}\max_{\mathbf{z}\in\{0,1\}^{n}}\sum_{v\in V}% \mathbf{z}_{v}\quad\text{s.t.~{} }\quad\mathbf{z}_{v}+\mathbf{z}_{u}\leq 1\>,% \forall(v,u)\in E.

(1)

The following quadratic integer program (QIP) in (2) (with an optimal solution that is equivalent to the optimal solution of the above ILP) can also be used to formulate the MIS problem [17]:

\text{{\bf QIP: }}\max_{\mathbf{z}\in\{0,1\}^{n}}\mathbf{z}^{T}(\mathbf{I}-% \mathbf{A}_{G})\mathbf{z}\>.

(2)

Furthermore, the work in [18] introduced the following semidefinite programming (SDP) relaxation of the MIS problem.

\begin{gathered}\text{{\bf SDP: }}\max_{\mathbf{X}\in\mathbb{R}^{n\times n}}~{% }\mathrm{tr}(\mathbf{J}_{n}\mathbf{X})~{}~{}\text{s.t.~{} }\mathrm{tr}(\mathbf% {X})=1,~{}\mathbf{X}_{u,v}=0,\forall(u,v)\in E,~{}\mathbf{X}\succeq 0\>,\end{gathered}

(3)

where the third constraint denotes the positive semi-definiteness on the optimization matrix $\mathbf{X}$ . We denote the optimal solution of (3) by $\mathbf{X}^{*}$ . The diagonal of $\mathbf{X}^{*}$ represent the ‘weight’ of each vertex in being part of the MIS, whereas the off-diagonal non-zero elements, i.e., for indices $(u,v)\notin E$ , indicate how likely nodes $u$ and $v$ are in the MIS. While Eq. (3) is convex (which means there exists one unique optimal solution) and runs in polynomial time, obtaining the MIS from $\mathbf{X}^{*}$ requires rounding techniques (such as spectral clustering [19]) that, in most cases, do not result in an optimal MIS in the graph.

2.2 Related work

1) Exact and Heuristic Solvers: Exact approaches for $\mathsf{NP}$ - $\mathsf{hard}$ problems typically rely on branch-and-bound global optimization techniques. However, exact approaches suffer from poor scalability, which limits their uses in large MIS problems [20]. This limitation has spurred the development of efficient approximation algorithms and heuristics. For instance, the well-known NetworkX library [21] implements a heuristic procedure for solving the MIS problem [6]. These polynomial-time heuristics often incorporate a mix of sub-procedures, including greedy algorithms, local search sub-routines, and genetic algorithms [22]. However, such heuristics generally cannot theoretically guarantee that the resulting solution is within a small factor of optimality. In fact, inapproximability results have been established for the MIS problem [23].

Among existing MIS heuristics, ReduMIS [3] has emerged as the leading approach. The ReduMIS framework contains two primary components: (i) an iterative application of various graph reduction techniques (e.g., the linear programming (LP) reduction method in [16]) with a stop** rule based on the non-applicability of these techniques; and (ii) an evolutionary algorithm. The ReduMIS algorithm initiates with a pool of independent sets and evolves them through multiple rounds. In each round, a selection procedure identifies favorable nodes by executing graph partitioning, which clusters the graph nodes into disjoint clusters and separators to enhance the solution. In contrast, our $\mathsf{Quant}$ - $\mathsf{Net}$ approach does not require such complex algorithmic operations (e.g., solution combination operation, community detection, and local search algorithms for solution improvement) as used in ReduMIS. More importantly, ReduMIS and ILP solvers scale with the number of nodes and the number of edges (which constraints their application on highly dense graphs), whereas $\mathsf{Quant}$ - $\mathsf{Net}$ only scales w.r.t. the number nodes, as will be demonstrated in our experimental results.

2) Data-Driven Learning-Based Solvers: As mentioned in Section 1, data-driven approaches for MIS problems can be classified into SL and RL methods. A notable SL method is proposed in [24], which combines several components including graph reductions [3], Graph Convolutional Networks (GCN) [25], guided tree search, and a solution improvement local search algorithm [26]. The GCN is trained on benchmark graphs using their solutions as ground truth labels, enabling the learning of probability maps for the inclusion of each vertex in the optimal solution. Then, a subset of ReduMIS subroutines is used to improve their solution. While the work in [24] reported on-par results to ReduMIS, it was later shown by [12] that replacing the GCN output with random values performs similarly to using the trained GCN network. Recently, DIFUSCO was introduced in [14], an approach that integrates Graph Neural Networks (GNNs) with diffusion models [27] to create a graph-based diffusion denoiser. DIFUSCO formulates the MIS problem in the discrete domain and trains a diffusion model to improve a single or a pool of solutions.

On the other hand, RL-based methods have achieved more success in solving the MIS problem when compared to SL methods. In the work of [28], a Deep Q-Network (DQN) is combined with graph embeddings, facilitating the discrimination of vertices based on their influence on the solution and ensuring scalability to larger instances. Meanwhile, the study presented in [29] introduces the Learning What to Defer (LwD) method, an unsupervised deep RL solver resembling tree search, where vertices are iteratively assigned to the independent set. Their model is trained using Proximal Policy Optimization (PPO) [30]. The work in [31] introduces DIMES, which combines a compact continuous space to parameterize the distribution of potential solutions and a meta-learning framework to facilitate the effective initialization of model parameters during the fine-tuning stage that is required for each graph.

It is worth noting that the majority of SL and RL methods are data-dependent in the sense that they require the training of a separate network for each dataset of graphs. These data-dependent methods exhibit limited generalization performance when applied to out-of-distribution graph data. This weak generalization stems from the need to train a different network for each graph dataset. In contrast, our dQNN approach differs from SL- and RL-based methods in that it does not rely on any training datasets. Instead, our dQNN approach utilizes a simple yet effective graph-encoded continuous objective function, which is defined solely in terms of the connectivity of a given graph.

3) Dataless Differentiable Methods: The most related work to ours is [32], which introduces dataless neural networks (dNNs) tailored for the MIS problem. Notably, their method operates without the need for training data and relies on $n$ trainable parameters. Their proposed methodology advocates using a ReLU-based continuous objective to solve the MIS problem. However, to scale up and improve their method, graph partitioning and local search algorithms were employed.

4) Discrete Sampling Solvers: In recent studies, researchers have explored the integration of energy-based models with parallel implementations of simulated annealing to address combinatorial optimization problems [33] without relying on any training data. For example, in tackling the Maximum Independent Set (MIS) problem, Sun et al. [34] proposed a solver that combines (i) Path Auxiliary Sampling [35] and (ii) the binary quadratic integer program in (2). However, unlike $\mathsf{Quant}$ - $\mathsf{Net}$ , these approaches entail prolonged sequential runtime and require fine-tuning of multiple hyperparameters. Moreover, the energy models utilized in this method for addressing the MIS problem may generate binary vectors that violate the “no edges” constraint inherent to the MIS problem. Consequently, a post-processing procedure becomes necessary.

3 $\mathsf{Quant}$ - $\mathsf{Net}$ : Dataless quadratic neural networks for the MIS problem

In this section, we introduce our dataless neural network, $\mathsf{Quant}$ - $\mathsf{Net}$ , designed to solve the MIS problem through neural training without the need for any training data.

3.1 $\mathsf{Quant}$ - $\mathsf{Net}$ : The model.

We will first present the model of our proposed $\mathsf{Quant}$ - $\mathsf{Net}$ that is (i) differentiable everywhere w.r.t. $\mathbf{x}$ , and (ii) free of hyper-parameter scheduling. A continuous relaxation of QIP (2) is

\begin{gathered}\min_{\mathbf{x}\in[0,1]^{n}}-\sum_{v\in V}\mathbf{x}_{v}+\sum% _{(u,v)\in E}\mathbf{x}_{v}\mathbf{x}_{u}\>,\quad\text{or}~{}\min_{\mathbf{x}% \in[0,1]^{n}}-\mathbf{e}_{n}^{T}\mathbf{x}+\frac{1}{2}\mathbf{x}^{T}\mathbf{A}% _{G}\mathbf{x}\>.\end{gathered}

(4)

Let $\mathbf{z}^{*}$ denote a binary minimizer of Problem (4). Then, it was shown in [36] that it corresponds to an MIS of size $\sum_{v\in V}\mathds{1}(\mathbf{z}^{*}_{v}=1)$ . Based on the quadratic MIS formulation in Problem (4), our proposed $\mathsf{Quant}$ - $\mathsf{Net}$ introduces several improvements and modifications to efficiently solve the MIS problem. In particular, $\mathsf{Quant}$ - $\mathsf{Net}$ incorporates an edges-penalty parameter $\gamma$ , which scales the influence of the edges of the graph $G$ on the optimization objective. Furthermore, $\mathsf{Quant}$ - $\mathsf{Net}$ uses the adjacency matrices of $G$ and $G^{\prime}$ . To see how $\mathsf{Quant}$ - $\mathsf{Net}$ is designed, we first consider the following $\gamma$ -parameterized augmented quadratic formulation for the MIS problem:

\begin{gathered}\!\!\!\!\min_{\mathbf{x}\in[0,1]^{n}}f(\mathbf{x}):=-\mathbf{e% }_{n}^{T}\mathbf{x}+\frac{\gamma}{2}\mathbf{x}^{T}\mathbf{A}_{G}\mathbf{x}-% \frac{1}{2}\mathbf{x}^{T}\mathbf{A}_{G^{\prime}}\mathbf{x},\!\!\!\end{gathered}

(5)

where $\gamma>1$ is the edges-penalty parameter. The rationale behind the third augmented term $-\frac{1}{2}\mathbf{x}^{T}\mathbf{A}_{G^{\prime}}\mathbf{x}$ in Problem (5) (corresponding to the edges of the complement graph $G^{\prime}$ ) is to encourage the optimizer to select two nodes with no edge connecting them in $G$ (implying an edge in $G^{\prime}$ ). We will theoretically show later in Theorem 4 that any MIS minimizer is a local minimizer of Problem (5) with an appropriately chosen $\gamma$ -value. Interestingly, the above continuous quadratic formulation of the MIS problem in Problem (5) admits a dataless implementation of the quadratic neural network (QNN), which was recently introduced in [37].

To see how Problem (5) corresponds to a $\mathsf{Quant}$ - $\mathsf{Net}$ , consider the graph example in Figure 1 (left), for which the $\mathsf{Quant}$ - $\mathsf{Net}$ is illustrated in Figure 1 (right). Here, the $\mathsf{Quant}$ - $\mathsf{Net}$ comprises two fully connected layers. The initial activation-free layer encodes information about the nodes (top $n=5$ connections), edges of $G$ (middle $m=4$ connections), and edges of $G^{\prime}$ (bottom $m^{\prime}=6$ connections), all without a bias vector. The subsequent fully connected layer is an activation-free layer performing a vector dot-product between the fixed weight vector (with $-1$ corresponding to the nodes and edges of $G^{\prime}$ and the edges-penalty parameter $\gamma$ ), and the output of the first layer.

Utilizing the SDP solution $\mathbf{X}^{*}$ of Problem (3), along with its interpretation discussed in Section 2.2, $\mathsf{Quant}$ - $\mathsf{Net}$ has the flexibility of incorporating $\mathbf{X}^{*}$ into the objective in Problem (5) as follows:

\begin{gathered}\!\!\!\!\min_{\mathbf{x}\in[0,1]^{n}}-\mathbf{s}^{T}\mathbf{x}% +\frac{\gamma}{2}\mathbf{x}^{T}\mathbf{A}_{G}\mathbf{x}-\frac{1}{2}\mathbf{x}^% {T}(\mathbf{A}_{G^{\prime}}\circ\mathbf{N})\mathbf{x}\>,~{}\mathbf{s}=\frac{% \mathrm{diag}(\mathbf{X}^{*})}{\max_{v}\mathbf{X}^{*}_{v,v}}\>,~{}\mathbf{N}=% \frac{\mathbf{X}^{*}}{\max_{(u,v)\notin E}\mathbf{X}^{*}_{u,v}}\>.\end{gathered}

(6)

$\mathbf{s}$ and $\mathbf{N}$ represent the likelihood of nodes in the graph $G$ and complement graph $G^{\prime}$ , respectively, to be included in the MIS. This serves as our rationale behind the objective function in Problem (6).

Remark 1.

Despite the polynomial-time complexity of the SDP formulation in (3) and the availability of efficient solvers such as MOSEK [38], efficiently obtaining the optimal solution in (3) is predominantly achievable for sparse graphs. This constraint emerges from the fact that the computational complexity of SDP grows proportionally with both the number of nodes and the number of edges in the graph. Consequently, the practical applicability of utilizing the objective function in Problem (6) is limited to sparse graphs.

Refer to caption — Figure 1: $\mathsf{Quant}$ - $\mathsf{Net}$ (right) for graph $G$ (left). Sets $\textrm{MIS}_{1}=\{v_{1},v_{4},v_{5}\}$ and $\textrm{MIS}_{2}=\{v_{3},v_{4},v_{5}\}$ correspond to an MIS in $G$ and an MC in $G^{\prime}$ . Set $\textrm{MIS}_{3}=\{v_{2},v_{3}\}$ corresponds to a maximal independent set but not a maximum independent set in $G$ .

3.2 $\mathsf{Quant}$ - $\mathsf{Net}$ : The training algorithm

Drawing from the objective function of Problem (5) and the network structure of $\mathsf{Quant}$ - $\mathsf{Net}$ , we introduce an MIS training algorithm. Notably, in contrast to the ReLU-based dNN proposed in [32], the $\mathsf{Quant}$ - $\mathsf{Net}$ formulation in Problem (5) is characterized by being fully differentiable across $\mathbf{x}$ , enabling more numerically stable optimization [39], as demonstrated in our experimental findings discussed in Section 4.

Algorithm 1 The

\mathsf{Quant}

\mathsf{Net}

MIS Training Algorithm.

Input: Graph $G$ , number of iterations $T$ , edge-penalty parameters $\gamma$ , set of initializations $S$ , and learning rate $\alpha$ of Adam.
Output: The best obtained MIS $\mathcal{I}^{*}$ in $G$ .
1: Initialize $S_{Q}=\{\cdot\}$ (an empty set to collect MISs).
2: For $\mathbf{x}[0]\in S$ (This runs in parallel)
3:    For $t\in[T]$
4:      Run an Adam iteration (with $\alpha$ ) to get $\mathbf{x}[t]$ from $\mathbf{x}[t-1]$ on Problem (5) (or Problem (6)) with $\gamma$ .
5:      Obtain $\mathbf{x}[t]\leftarrow\mathrm{Proj}_{[0,1]^{n}}(\mathbf{x}[t])$ (box constraints).
6:      Obtain $\mathcal{I}(\mathbf{x}[t]):=\{v\in V:\mathbf{x}_{v}[t]>0\}$ .
7:      If $\mathcal{I}(\mathbf{x}[t])$ is a MIS in $G$ : Then $S_{Q}\leftarrow S_{Q}\cup\mathcal{I}(\mathbf{x}[t])$ . Break the inner for loop.
8: Return $\mathcal{I}^{*}=\operatorname*{argmax}_{\mathcal{I}\in S_{Q}}|\mathcal{I}|$

Our objective functions in Problems (5) and (6) are highly non-convex which makes finding the global minimizer(s) a challenging task. The work in [40] details the complexity of box-constrained continuous quadratic optimization problems. Gradient-based optimizers like Adam [15] are effective for finding a local minimizer given an initialization in $[0,1]^{n}$ . Due to the full differentiability of our objective (Problems (5) and (6)), Adam empirically proves to be computationally highly efficient. Consequently, for a single graph, we can initiate multiple optimizations from various points in $[0,1]^{n}$ and execute Adam in parallel for each. Specifically, with a specified number of batches (parallel processes) $B$ , we define set $S$ to denote all the initializations, where $|S|=B$ , and consider the following three approaches:

•

Random Initialization: Here, each vector in set $S$ is obtained by sampling from each entry independently from the uniform distribution. This strategy is effective when the degree of each vertex is similar to all other vertices such as the Erdos-Renyi (ER) [41] graphs.
•

SDP-based Initialization: Given that the diagonal of $\mathbf{X}^{*}$ represents the ‘weight’ of each vertex to be in the MIS, we propose using the SDP solution by which set $S$ consists of vector $\mathbf{s}$ and $B-1$ samples drawn from a Gaussian distribution with mean vector $\mathbf{s}$ and covariance matrix $\eta\mathbf{I}$ . Here, $\eta$ serves as a hyper-parameter that governs the exploration around $\mathbf{s}$ . In this case, we use Problem (6). This strategy is effective particularly in scenarios where solving the SDP is computationally tractable, such as in the case of sparse graphs.
•

Degree-based Initialization: Following the intuition that vertices with higher degrees are less likely to belong to an MIS compared to those with lower degrees, we propose using set $S$ with $B$ samples drawn from a Gaussian distribution with mean vector $\mathbf{g}$ obtained as $\mathbf{g}_{v}=1-\frac{\mathrm{d}(v)}{\Delta(G)},\forall v\in V\>,\mathbf{g}% \leftarrow\frac{\mathbf{g}}{\max_{v}\mathbf{g}_{v}}\>,$ and covariance matrix $\eta\mathbf{I}$ . This will be our choice when computing the SDP solution is computationally expensive, i.e., for dense graphs.

We outline the MIS training procedure for $\mathsf{Quant}$ - $\mathsf{Net}$ in Algorithm 1. As shown, the algorithm takes a graph $G$ , a set of initializations $S$ , the maximum number of iterations per batch $T$ (with iteration index $t$ ), an edge-penalty parameter $\gamma$ , and Adam learning rate $\alpha$ as inputs. For each batch and iteration $t$ , the Adam optimizer updates $\mathbf{x}$ (Line 4). In Line 5, a projection onto $[0,1]^{n}$ is employed. In Lines 6 and 7, the algorithm checks whether the updated $\mathbf{x}$ corresponds to an MIS in the graph. If yes, the algorithm stops for this batch. Finally, the best-found MIS, determined by its cardinality, is returned in Line 8. The blue font is used to indicate the case when we use the SDP initialization.

3.3 $\mathsf{Quant}$ - $\mathsf{Net}$ : Theoretical foundation

Here, we provide the necessary and sufficient condition on the $\gamma$ -value for any MIS to correspond to local minimizers of Problem (5). Moreover, we also provide a sufficient condition for all local minimizers of Problem (5) to be associated with a MIS. We relegate the proofs to Appendix A.

Definition 3 (MIS vector).

Given a graph $G=(V,E)$ , a binary vector $\mathbf{x}\in\{0,1\}^{n}$ is called a MIS vector if there exists a MIS $S$ of $G$ such that $\mathbf{x}_{i}=1$ for all $i\in S$ , and $\mathbf{x}_{i}=0$ for all $i\notin S$ .

Theorem 4 (Necessary and Sufficient Condition on $\gamma$ for MIS vectors to be local minimizers of Problem (5)).

Given an arbitrary graph $G=(V,E)$ and its corresponding $\mathsf{Quant}$ - $\mathsf{Net}$ formulation in Problem (5), suppose the size of the largest MIS of $G$ is $k$ . Then, $\gamma\geq k+1$ is necessary and sufficient for all MIS vectors to be local minimizers of Problem (5) for arbitrary graphs.

Remark 2.

Theorem 4 provides a guideline for choosing $\gamma$ in $\mathsf{Quant}$ - $\mathsf{Net}$ . Note that the MIS set size $k$ is usually unknown a priori. But we may employ any classical estimate of the MIS size $k$ to guide the choice of $\gamma$ (e.g., we know from [42] that $k\geq\sum_{v\in V}\frac{1}{1+\textrm{d}(v)}$ ).

Next, we provide a stronger condition on $\gamma$ that ensures all local minimizers of Problem (5) correspond to a MIS.

Theorem 5 ( $\mathsf{Quant}$ - $\mathsf{Net}$ Local Minimizers).

Given graph $G=(V,E)$ and set $\gamma\geq n$ , all local minimizers of (5) are MIS vectors of $G$ .

Remark 3.

The assumption $\gamma\geq n$ in Theorem 5 is stronger than that in Theorem 4 (as for any graph $G$ with $E\neq\emptyset$ (non-empty graph), we have $n>k$ ). The trade-off of choosing a large $\gamma$ -value is that while larger values of $\gamma$ ( $\gamma\geq n$ ) ensure that only MIS are local minimizers, they also increase the non-convexity of the optimization problem, thereby making it more difficult to solve.

Remark 4.

Although the proposed constrained quadratic Problem (5) is still NP-hard to solve for the global minimizer, it is a relaxation of the original integer programming problem. It can leverage gradient information, allowing the use of high-performance computational resources and parallel processing to enhance the efficiency and scalability of our approach.

4 Experimental results

1) Settings, Baselines, & Benchmarks: Graphs are processed using the NetworkX library [21]. For baselines, we utilize Gurobi [9] and the recent Google solver CP-SAT [10] for the ILP in (1), ReduMIS [3], iSCO¹¹1https://github.com/google-research/discs [34], and four learning-based methods: DIMES [31], DIFUSCO [14], LwD [29], and the GCN method in [24] (commonly referred to as ‘Intel’). We note that, following the analysis in [12], GCN’s code cloning to ReduMIS is disabled, which was also done in [14]. Aligned with recent SOTA methods (DIMES, DIFUSCO, and iSCO), we employ the Erdos-Renyi (ER) [41] graphs from [31] and the SATLIB graphs [43] as benchmarks. The ER dataset²²2https://github.com/DIMESTeam/DIMES consists of $128$ graphs with $700$ to $800$ nodes and $p=0.15$ , where $p$ is the probability of edge creation. The SATLIB dataset consists of $500$ graphs (with at most $1,347$ nodes and $5,978$ edges). Additionally, the GNM random graph generator function of NetworkX is utilized for our scalability experiment. For $\mathsf{Quant}$ - $\mathsf{Net}$ , the edges-penalty parameter $\gamma$ is selected as $775$ . The initial learning rate is $0.6$ for ER, 0.9 for SATLIB, and 0.5 for GNM. The number of iterations per initialization, $T$ , is set to $150$ for ER, $50$ for SATLIB, and 350 for GNM. The exploration parameter for SDP-based and Degree-based initialization is set to $\eta=2.25$ . Our code³³3https://anonymous.4open.science/r/Quant-Net/README.md uses PyTorch [44] to construct the objective function in $\mathsf{Quant}$ - $\mathsf{Net}$ , and PyTorch’s implementation of Adam to optimize. Further implementation details and results are provided in Appendix B and Appendix D, respectively.

2) ER and SATLIB Benchmark Results:

Here, we present the results of $\mathsf{Quant}$ - $\mathsf{Net}$ , along with the other considered baselines using the SATLIB (Table 1(a)) and ER (Table 1(b)) benchmarks in terms of average MIS size over the graphs in the dataset and the total sequential run-time required to obtain the results for all the graphs. We note that the results of the learning-based methods are sourced from [14]. In what follows, we provide observations on these results.

•

All learning-based methods, except for GCN, require training a separate network for each graph dataset, as indicated in the third column of Table 1(a) and Table 1(b). This illustrates the generalization limitations of these methods. In contrast, our method is more generalizable, as it only requires tuning a few hyper-parameters for each set of graphs.
•

When compared to learning-based approaches, on ER (resp. SATLIB), our method outperforms all (resp. most) baseline methods, all without requiring any training data. Run-time comparison with these methods is not considered, as the reported numbers exclude training time, which may vary depending on multiple factors such as graph size, available compute, number of data points, and the used neural network architecture. Furthermore, our approach does not rely on additional techniques such as Greedy Decoding [45] and Monte Carlo Tree Search [46].
•

When compared to iSCO, our method reports almost similar MIS size for SATLIB, while falling by nearly two nodes on ER. Nevertheless, our method requires significantly reduced sequential run time. It is important to note that the iSCO paper [34] reports a lower run time as compared to other methods. This reported run time is achieved by evaluating the test graphs in parallel, in contrast to all other methods that evaluated them sequentially. To fairly compare methods in our experiments, we opted to report sequential test run time only. The extended sequential run-time of iSCO, compared to its parallel run-time, is due to its use of simulated annealing. Because simulated annealing depends on knowing the energy of the previous step when determining the next step, it is inherently more efficient for iSCO to solve many graphs in parallel than in series.
•

For SATLIB, which consists of highly sparse graphs, on average, $\mathsf{Quant}$ - $\mathsf{Net}$ falls short by a few nodes when compared to ReduMIS, Gurobi, and CP-SAT. The reason ReduMIS achieves SOTA of results here is that a large set of MIS-specific graph reductions can be applied. However, for denser graphs, most of these graph reductions are not applicable. Gurobi (and CP-SAT) solves the ILP in (1) by which the number of constraints is equal to the number of edges in the graph. This means that Gurobi and CP-SAT are expected to perform much better in sparse graph such as SATLIB.
•

On ER, when compared to Gurobi and CP-SAT, our method not only reports a larger average MIS size but also requires less than half the run-time. This is because ER is relatively denser compared to SATLIB. As a result, when run for 64 minutes on ER, Gurobi and CP-SAT fall short compared to our method and ReduMIS, while reporting the same average MIS as ReduMIS for SATLIB.

Table 1: Benchmark dataset results in terms of average MIS size and total sequential run-time (minutes). RL, SL, G, S, and TS represent Reinforcement Learning, Supervised Learning, Greedy decoding, Sampling, and Tree Search, respectively. We note that the run time reported in iSCO (Table 1 in [34]) is for running multiple graphs in parallel, not a sequential total run time. Therefore, we ran a few graphs sequentially and obtained the extrapolated run-time in column 5. SDP, Degree-based Initialization (DI), and random initialization (RI) represent the initializations used with

\mathsf{Quant}

\mathsf{Net}

. ReduMIS employs the local search procedure from [26], which no other method in the table uses, following the study in [12]. For more details about the requirements of each method, see Appendix C.

(a) SATLIB Dataset

Method	Type	Training Data	MIS Size	Total Run-time (m)	Run-time Comment
ReduMIS	Heuristics	Not Required	425.96	37.58	Run until completion
CP-SAT	Exact	Not Required	425.96	0.78	Run until completion
Gurobi	Exact	Not Required	425.96	8.16	Run until completion
GCN	SL+G	SATLIB	420.66	23.05	This excludes training time
LwD	RL+S	SATLIB	422.22	18.83	This excludes training time
DIMES	RL+TS	SATLIB	422.22	18.83	This excludes training time
DIMES	RL+S	SATLIB	423.28	20.26	This excludes training time
DIFUSCO	SL+G	SATLIB	424.5	8.76	This excludes training time
DIFUSCO	SL+S	SATLIB	425.13	23.74	This excludes training time
iSCO	Sampling	Not Required	423.7	$\sim$ 7500	Sequential runtime; original paper shows parallelized runtime
$\mathsf{Quant}$ - $\mathsf{Net}$ + SDP (Ours)	dQNN	Not Required	423.22	89.69	This excludes SDP time (30 seconds per graph)
$\mathsf{Quant}$ - $\mathsf{Net}$ + DI (Ours)	dQNN	Not Required	423.03	64.9	Run until completion

(b) ER Dataset

Method	Type	Training Data	MIS Size	Total Run-time (m)	Run-time Comment
ReduMIS	Heuristics	Not Required	44.87	52.13	Run until completion
CP-SAT	Exact	Not Required	41.09	64.00	Run with 30 second time limit per graph
Gurobi	Exact	Not Required	39.19	64.00	Run with 30 second time limit per graph
GCN	SL+G	SATLIB	34.86	6.06	This excludes training time
GCN	SL+TS	SATLIB	38.8	20	This excludes training time
LwD	RL+S	ER	41.17	6.33	This excludes training time
DIMES	RL+TS	ER	38.24	6.12	This excludes training time
DIMES	RL+S	ER	42.06	12.01	This excludes training time
DIFUSCO	SL+G	ER	38.83	8.8	This excludes training time
DIFUSCO	SL+S	ER	41.12	26.27	This excludes training time
iSCO	Sampling	Not Required	44.8	$\sim$ 384	Sequential runtime; original paper shows parallelized runtime
$\mathsf{Quant}$ - $\mathsf{Net}$ + RI (Ours)	dQNN	Not Required	43.52	21	Run until completion

3) Scalability Results:

It is well-established that relatively denser graphs pose greater computational challenges compared to sparse graphs. This observation diverges from the trends exhibited by other baselines, which predominantly excel on sparse graphs. We argue that this is due to the applicability of graph reduction techniques such as the LP reduction method in [16], and the unconfined vertices rule [47] (see [3] for a complete list of the graph reduction rules that apply only on sparse graphs). For instance, by simply applying the LP graph reduction technique, the large-scale highly sparse graphs (with several hundred thousand nodes), considered in Table 5 of [24], reduce to graphs of a few thousands nodes with often dis-connected sub-graphs that can be treated independently.

Therefore, the scalability and performance of ReduMIS are significantly dependent by the sparsity of the graph. This dependence emerges from the iterative application of various graph reduction techniques in ReduMIS, specifically tailored for sparse graphs. For instance, the ReduMIS results presented in Table 2 of [29] are exclusively based on extensive and highly sparse graphs. This conclusion is substantiated by both the sizes of the considered graphs and the corresponding sizes of the obtained MIS solutions. As such, in this experiment, we investigate the scalability of $\mathsf{Quant}$ - $\mathsf{Net}$ for the MIS problem against the SOTA data-independent methods: ReduMIS, Gurobi, and CP-SAT. Here, we use randomly generated graphs with the GNM generator by which the number of edges is set to $m=\lceil\frac{n(n-1)}{4}\rceil$ . It is important to note that the density of these graphs is significantly higher than those considered in previous works. This choice of the number of edges in the GNM function indicate that half of the total possible edges (w.r.t. the complete graph) exist.

Results are provided in Figure 2. As observed, for dense graphs, as the graph size increases, our method requires significantly less run-time (Figure 2(a)) compared to all baselines, while reporting almost the same average MIS size (Table 2(b)). For instance, when $n$ is $500$ , on average, our method requires around 1 minute to solve the 5 graphs, whereas other baselines require approximately 45 minutes or more to achieve the same MIS size. These results indicate that, unlike ReduMIS and ILP solvers, the run-time of our method scales only with the number of nodes in the graph, which is a significant improvement.

$(n,m)$	ReduMIS	Gurobi	CP-SAT	$\mathsf{Quant}$ - $\mathsf{Net}$ (Ours)
$(50,613)$	$7.6$	$7.6$	$7.6$	$7.6$
$(500,62375)$	$13.4$	$13.4$	$13.4$	$13.4$
$(1000,249750)$	$15.0$	N/A	N/A	$14.8$
$(1500,562125)$	$16.0$	N/A	N/A	$15.2$
$(2000,999500)$	$16.4$	N/A	N/A	$16$

5 Conclusion

This study addressed the challenging Maximum Independent Set (MIS) Problem within the domain of Combinatorial Optimization by introducing an innovative continuous formulation employing dataless quadratic neural networks. By eliminating the need for any training data, $\mathsf{Quant}$ - $\mathsf{Net}$ sets itself apart from conventional learning approaches. Through the utilization of gradient-based optimization using ADAM and a GPU implementation, our straightforward yet effective approach demonstrates competitive performance compared to state-of-the-art learning-based and sampling-based methods. This research offers a distinctive perspective on approaching discrete optimization problems through parameter-efficient neural networks that are trained from the problem structure, not from datasets.

References

Karp [1972] Richard M Karp. Reducibility among combinatorial problems. In Complexity of computer computations, pages 85–103. Springer, 1972.
Bengio et al. [2021] Yoshua Bengio, Andrea Lodi, and Antoine Prouvost. Machine learning for combinatorial optimization: a methodological tour d’horizon. European Journal of Operational Research, 290(2):405–421, 2021.
Lamm et al. [2016] Sebastian Lamm, Peter Sanders, Christian Schulz, Darren Strash, and Renato F Werneck. Finding near-optimal independent sets at scale. In 2016 Proceedings of the Eighteenth Workshop on Algorithm Engineering and Experiments (ALENEX), pages 138–150. SIAM, 2016.
Akiba and Iwata [2016] Takuya Akiba and Yoichi Iwata. Branch-and-reduce exponential/fpt algorithms in practice: A case study of vertex cover. Theoretical Computer Science, 609:211–225, 2016.
San Segundo et al. [2011] Pablo San Segundo, Diego Rodríguez-Losada, and Agustín Jiménez. An exact bit-parallel algorithm for the maximum clique problem. Computers & Operations Research, 38(2):571–581, 2011.
Boppana and Halldórsson [1992] Ravi Boppana and Magnús M Halldórsson. Approximating maximum independent sets by excluding subgraphs. BIT Numerical Mathematics, 32(2):180–196, 1992.
Tarjan and Trojanowski [1977] Robert Endre Tarjan and Anthony E Trojanowski. Finding a maximum independent set. SIAM Journal on Computing, 6(3):537–546, 1977.
[8] IBM. IBM ILOG CPLEX Optimization Studio. URL https://www.ibm.com/products/ilog-cplex-optimization-studio.
[9] Gurobi. Gurobi Optimization. URL https://www.gurobi.com.
Google, Inc. [2022] Google, Inc. Google or-tools. 2022. URL https://developers.google.com/optimization.
He et al. [2014] He He, Hal Daume III, and Jason M Eisner. Learning to search in branch and bound algorithms. Advances in neural information processing systems, 27:3293–3301, 2014.
Böther et al. [2022] Maximilian Böther, Otto Kißig, Martin Taraz, Sarel Cohen, Karen Seidel, and Tobias Friedrich. What’s wrong with deep learning in tree search for combinatorial optimization. arXiv preprint arXiv:2201.10494, 2022.
Dong et al. [2021] Yuanyuan Dong, Andrew V Goldberg, Alexander Noe, Nikos Parotsidis, Mauricio GC Resende, and Quico Spaen. New instances for maximum weight independent set from a vehicle routing application. In Operations Research Forum, volume 2, pages 1–6. Springer, 2021.
Sun and Yang [2023] Zhiqing Sun and Yiming Yang. Difusco: Graph-based diffusion solvers for combinatorial optimization. arXiv preprint arXiv:2302.08224, 2023.
Kingma and Ba [2015] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR (Poster), 2015.
Nemhauser and Trotter [1975] George L Nemhauser and Leslie Earl Trotter. Vertex packings: Structural properties and algorithms. Mathematical Programming, 8(1):232–248, 1975.
Pardalos and Rodgers [1992] Panos M Pardalos and Gregory P Rodgers. A branch and bound algorithm for the maximum clique problem. Computers & operations research, 19(5):363–375, 1992.
Lovász [1979] László Lovász. On the shannon capacity of a graph. IEEE Transactions on Information theory, 25(1):1–7, 1979.
Von Luxburg [2007] Ulrike Von Luxburg. A tutorial on spectral clustering. Statistics and computing, 17(4):395–416, 2007.
Dai et al. [2016] Hanjun Dai, Bo Dai, and Le Song. Discriminative embeddings of latent variable models for structured data. In International conference on machine learning, pages 2702–2711. PMLR, 2016.
Hagberg et al. [2008] Aric A. Hagberg, Daniel A. Schult, and Pieter J. Swart. Exploring network structure, dynamics, and function using networkx. In Gaël Varoquaux, Travis Vaught, and Jarrod Millman, editors, Proceedings of the 7th Python in Science Conference, pages 11 – 15, Pasadena, CA USA, 2008.
Williamson and Shmoys [2011] David P Williamson and David B Shmoys. The design of approximation algorithms. Cambridge university press, 2011.
Berman and Schnitger [1992] Piotr Berman and Georg Schnitger. On the complexity of approximating the independent set problem. Information and Computation, 96(1):77–94, 1992.
Li et al. [2018] Zhuwen Li, Qifeng Chen, and Vladlen Koltun. Combinatorial optimization with graph convolutional networks and guided tree search. In NeurIPS, 2018.
Defferrard et al. [2016] Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst. Convolutional neural networks on graphs with fast localized spectral filtering. Advances in neural information processing systems, 29:3844–3852, 2016.
Andrade et al. [2012] Diogo V Andrade, Mauricio GC Resende, and Renato F Werneck. Fast local search for the maximum independent set problem. Journal of Heuristics, 18(4):525–547, 2012.
Ho et al. [2020] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
Dai et al. [2017] Hanjun Dai, Elias B Khalil, Yuyu Zhang, Bistra Dilkina, and Le Song. Learning combinatorial optimization algorithms over graphs. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pages 6351–6361, 2017.
Ahn et al. [2020] Sungsoo Ahn, Younggyo Seo, and **woo Shin. Learning what to defer for maximum independent sets. In International Conference on Machine Learning, pages 134–144. PMLR, 2020.
Schulman et al. [2017] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
Qiu et al. [2022] Ruizhong Qiu, Zhiqing Sun, and Yiming Yang. Dimes: A differentiable meta solver for combinatorial optimization problems. Advances in Neural Information Processing Systems, 35:25531–25546, 2022.
Alkhouri et al. [2022] Ismail R Alkhouri, George K Atia, and Alvaro Velasquez. A differentiable approach to the maximum independent set problem using dataless neural networks. Neural Networks, 155:168–176, 2022.
Goshvadi et al. [2024] Katayoon Goshvadi, Haoran Sun, Xingchao Liu, Azade Nova, Ruqi Zhang, Will Grathwohl, Dale Schuurmans, and Hanjun Dai. Discs: A benchmark for discrete sampling. Advances in Neural Information Processing Systems, 36, 2024.
Sun et al. [2023] Haoran Sun, Katayoon Goshvadi, Azade Nova, Dale Schuurmans, and Hanjun Dai. Revisiting sampling for combinatorial optimization. In International Conference on Machine Learning, pages 32859–32874. PMLR, 2023.
Sun et al. [2021] Haoran Sun, Hanjun Dai, Wei Xia, and Arun Ramamurthy. Path auxiliary proposal for mcmc in discrete space. In International Conference on Learning Representations, 2021.
Mahdavi Pajouh et al. [2013] Foad Mahdavi Pajouh, Balabhaskar Balasundaram, and Oleg A Prokopyev. On characterization of maximal independent sets via quadratic optimization. Journal of Heuristics, 19:629–644, 2013.
Fan et al. [2020] Fenglei Fan, **jun Xiong, and Ge Wang. Universal approximation with quadratic deep networks. Neural Networks, 124:383–392, 2020.
[38] MOSEK ApS. MOSEK: Optimization software. https://www.mosek.com.
Liu et al. [2021] Bo Liu, Zhaoying Liu, Ting Zhang, and Tongtong Yuan. Non-differentiable saddle points and sub-optimal local minima exist for deep relu networks. Neural Networks, 144:75–89, 2021.
Burer and Letchford [2009] Samuel Burer and Adam N Letchford. On nonconvex quadratic programming with box constraints. SIAM Journal on Optimization, 20(2):1073–1089, 2009.
Erdos et al. [1960] Paul Erdos, Alfréd Rényi, et al. On the evolution of random graphs. Publ. Math. Inst. Hung. Acad. Sci, 5(1):17–60, 1960.
Wei [1981] Victor K Wei. A lower bound on the stability number of a simple graph. Technical report, Bell Laboratories Technical Memorandum Murray Hill, NJ, USA, 1981.
Hoos and Stützle [2000] Holger H Hoos and Thomas Stützle. Satlib: An online resource for research on sat. Sat, 2000:283–292, 2000.
[44] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch. https://pytorch.org/.
Graikos et al. [2022] Alexandros Graikos, Nikolay Malkin, Nebojsa Jojic, and Dimitris Samaras. Diffusion models as plug-and-play priors. Advances in Neural Information Processing Systems, 35:14715–14728, 2022.
Fu et al. [2021] Zhang-Hua Fu, Kai-Bin Qiu, and Hongyuan Zha. Generalize a small pre-trained model to arbitrarily large tsp instances. In Proceedings of the AAAI conference on artificial intelligence, volume 35, pages 7474–7482, 2021.
Xiao and Nagamochi [2013] Mingyu Xiao and Hiroshi Nagamochi. Confining sets and avoiding bottleneck cases: A simple maximum independent set algorithm in degree-3 graphs. Theoretical Computer Science, 469:92–104, 2013.

Appendix A Proofs

Let’s begin by re-stating our main optimization problem:

\begin{gathered}\!\!\!\!\min_{\mathbf{x}\in[0,1]^{n}}f(\mathbf{x}):=-\mathbf{e% }_{n}^{T}\mathbf{x}+\frac{\gamma}{2}\mathbf{x}^{T}\mathbf{A}_{G}\mathbf{x}-% \frac{1}{2}\mathbf{x}^{T}\mathbf{A}_{G^{\prime}}\mathbf{x}.\!\!\!\end{gathered}

(7)

The gradient of (7) is:

\nabla_{\mathbf{x}}f(\mathbf{x})=-\mathbf{e}_{n}+(\gamma\mathbf{A}_{G}-\mathbf% {A}_{G^{\prime}})\mathbf{x}\>.

(8)

For some $v\in V$ , we have

\frac{\partial f(\mathbf{x})}{\partial\mathbf{x}_{v}}=-1+\gamma\sum_{u\in% \mathcal{N}(v)}\mathbf{x}_{u}-\sum_{u\in\mathcal{N}^{\prime}(v)}\mathbf{x}_{u}

(9)

A.1 Proof of Theorem 4

Proof.

Let $S$ be an MIS. Define the vector $\mathbf{x}^{S}$ such that it contains $1$ ’s at positions corresponding to the nodes in the set $S$ , and $0$ ’s at all other positions. For any MIS to be a local minimizer of Problem (6), it is sufficient and necessary to require that

	$\displaystyle\frac{\partial f(\mathbf{x})}{\partial\mathbf{x}_{v}}\geq 0,\quad% \forall v\notin S\textrm{ and}$		(10)
	$\displaystyle\frac{\partial f(\mathbf{x})}{\partial\mathbf{x}_{v}}\leq 0,\quad% \forall v\in S.$		(11)

Here, $\mathbf{x}_{v}$ is the element of $\mathbf{x}$ at the position corresponding to the node $v$ . (10) is derived because if $v\notin S$ , then $\mathbf{x}^{S}_{v}=0$ (by the definition of $\mathbf{x}^{S}$ ) so it is at the left boundary of the interval $[0,1]$ . For the left boundary point to be a local minimizer, it requires the derivative to be non-negative (i.e., moving towards the right only increases the objective). Similarly, when $v\in S$ , $\mathbf{x}^{S}_{v}=1$ , is at the right boundary for (11), at which the derivative should be non-positive.

The derivative of $f$ computed in (9) can be rewritten as

\frac{\partial f(\mathbf{x})}{\partial\mathbf{x}_{v}}=-1+\gamma m_{v}-\ell_{v}% ,\quad\forall v\notin S,

(12)

where $m_{v}:=\left|\{u\in\mathcal{N}(v)\cap S\}\right|$ is the number of neighbours of $v$ in $S$ and $\ell_{v}$ is the number of non-neighbours of $v$ in $S$ i.e., $\ell_{v}:=\left|\{u\in\mathcal{N}^{\prime}(v)\cap S\}\right|$ where $\mathcal{N}^{\prime}(v)=\{u:(u,v)\in E^{\prime}\}$ . By this definition, we immediately have $1\leq m_{v}\leq|S|$ and $0\leq\ell_{v}\leq|S|$ , where the upper and lower bounds for $m_{v}$ and $\ell_{v}$ are all attainable by some special graphs. Note that the lower bound of $m_{v}$ is $1$ , and that is due the fact that $S$ is a MIS, so any other node (say $v$ ) will have at least $1$ edge connected to a node in $S$ .

Plugging (12) into (10), we obtain

\gamma\geq\frac{1+\ell_{v}}{m_{v}}.

(13)

Since we’re seeking a universal $\gamma$ for all the graphs, we must set $m_{v}$ to its lowest possible value, $1$ , and $\ell_{v}$ to its highest possible value $k$ (both are attainable by some graphs), and still requires $\gamma$ to satisfy (13). This means it is necessary and sufficient to require $\gamma\geq k+1$ . In addition, (11) is satisfied unconditionally and therefore does not impose any extra condition on $\gamma$ . ∎

A.2 Proof of Theorem 5

Lemma 6.

All local minimizers of Problem (7) are binary vectors.

Proof.

Let $\mathbf{x}^{*}$ be any local minimizer of (7), if all the coordinates of $\mathbf{x}$ are either 0 or 1, then $\mathbf{x}^{*}$ is binary and the proof is complete, otherwise, at least one coordinate of $\mathbf{x}^{*}$ is in the interior $(0,1)$ and we aim to prove that this is not possible (i.e. such a non-binary $\mathbf{x}^{*}$ cannot exist as a minimizer) by contradiction. We assume the non-binary $\mathbf{x}^{*}$ exists, and denote the set of non-binary coordinates as

J:=\{j:\mathbf{x}^{*}_{j}\in(0,1)\}\>.

(14)

Since $\mathbf{x}^{*}$ is non-binary, $J\neq\emptyset$ . Since the objective function $f(\mathbf{x})$ of (7) is twice differentiable with respect to all $\mathbf{x}_{j}$ with $\mathbf{x}_{j}\in(0,1)$ , then a necessary condition for $\mathbf{x}^{*}$ to be a local minimizer is that

\nabla f(\mathbf{x}^{*})\big{|}_{J}=0,\quad\nabla^{2}f(\mathbf{x}^{*})\big{|}_% {J}\succeq 0,

where $\nabla f(\mathbf{x}^{*})\big{|}_{J}$ is the vector $\nabla f(\mathbf{x}^{*})$ restricted to the index set $J$ , and $\nabla^{2}f(\mathbf{x}^{*})\big{|}_{J}$ is the matrix $\nabla^{2}f(\mathbf{x}^{*})$ whose row and column indices are both restricted to the set $J$ .

However, the second necessary condition $\nabla^{2}f(\mathbf{x}^{*})\big{|}_{J}\succeq 0$ cannot hold. Because if it does, then we must have $\mathrm{tr}(\nabla^{2}f(\mathbf{x}^{*})\big{|}_{J})>0$ (the trace cannot strictly equal to 0 as $\nabla^{2}f(\mathbf{x}^{*})\big{|}_{J}=\gamma\mathbf{A}_{G}-\mathbf{A}_{G^{% \prime}}\neq 0$ ). However, on the other hand, we have

\mathrm{tr}(\nabla^{2}f(\mathbf{x}^{*})\big{|}_{J})=\mathrm{tr}(\mathbf{I}_{J}% (\gamma\mathbf{A}_{G}-\mathbf{A}_{G^{\prime}})\mathbf{I}_{J}^{T})=0

as the diagonal entries of $\mathbf{A}_{G}$ and $\mathbf{A}_{G^{\prime}}$ are all 0, which leads to a contradiction. Here $\mathbf{I}_{j}$ denotes the identity matrix with row indices restricted to the index set $J$ . ∎

Theorem 7 (Re-statement of Theorem 5).

Given graph $G=(V,E)$ and set $\gamma\geq n$ , all local minimizers of (5) correspond to an MIS in $G$ .

Proof.

By lemma 6, we can only consider binary vectors as local minimizers. With this, we first prove that all local minimizers are Independent Sets (ISs). Then, we show that any IS, that is not a maximal IS, is not a local minimizer.

•

Here, we show that any local minimizer is an IS. By contradiction, assume that vector $\mathbf{x}$ , by which $\mathbf{x}_{v}=\mathbf{x}_{w}=1$ such that $(v,w)\in E$ (a binary vector with an edge in $G$ ), is a local minimizer. Since $\mathbf{x}_{v}=1$ is at the right boundary of the interval $[0,1]$ , for it to be a local minimizer, we must have $\frac{\partial f}{\partial\mathbf{x}_{v}}\leq 0$ . Together with (9), this implies

-1+\gamma\sum_{u\in\mathcal{N}(v)}\mathbf{x}_{u}-\sum_{u\in\mathcal{N}^{\prime% }(v)}\mathbf{x}_{u}\leq 0\>.

(15)

Re-arranging (15) and using $\gamma\geq n$ yields to

n\sum_{u\in\mathcal{N}(v)}\mathbf{x}_{u}\leq 1+\sum_{u\in\mathcal{N}^{\prime}(% v)}\mathbf{x}_{u}\>.

(16)

Given that $n>\Delta(G^{\prime})$ , the condition in (16) can not be satisfied even if the LHS attains its minimum value (which is $n$ ) and the RHS attains a maximum value. The maximum possible value of the RHS is $1+\mathrm{d}^{\prime}(v)=n-\mathrm{d}(v)$ , where $\mathrm{d}^{\prime}(v)$ is the degree of node $v$ in $G^{\prime}$ , and the maximum possible value of $\mathrm{d}^{\prime}(v)$ is $\Delta(G^{\prime})$ . This means that when an edge exists in $\mathbf{x}$ , it can not be a fixed point. Thus, only ISs are local minimizers.

•

Here, we show that Independent Sets that are not maximal are not local minimizers. Define vector $\mathbf{x}\in\{0,1\}^{n}$ that corresponds to an IS $\mathcal{I}(\mathbf{x})$ . This means that there exists a node $u\in V$ that is not in the IS and is not in the neighbor set of all nodes in the IS. Formally, if there exists $u\notin\mathcal{I}(\mathbf{x})$ such that $\forall w\in\mathcal{I}(\mathbf{x}),u\notin\mathcal{N}(w)$ , then $\mathcal{I}(\mathbf{x})$ is an IS, not a maximal IS. Note that such an $\mathbf{x}$ satisfies $\mathbf{x}_{u}=0$ and

\frac{\partial f}{\partial\mathbf{x}_{u}}=-1+\gamma\sum_{u\in\mathcal{N}(v)}% \mathbf{x}_{u}-\sum_{u\in\mathcal{N}^{\prime}(v)}\mathbf{x}_{u}=-1+\gamma\sum_% {u\in\mathcal{N}(v)}\mathbf{x}_{u}-\sum_{u\in\mathcal{N}^{\prime}(v)}\mathbf{x% }_{u}<0\>,

(17)

which implies increasing $\mathbf{x}_{u}$ can further decrease the function value, contradicting to $\mathbf{x}$ being a local minimizer. In (17), the second summation is $0$ as $\mathcal{N}(v)\cap\mathcal{I}(\mathbf{x})=\emptyset$ , which results in $-(1+\sum_{u\in\mathcal{N}^{\prime}(v)}\mathbf{x}_{u})$ that is always negative. Thus, any binary vector that corresponds to an IS that is not maximal is not a local minimizer.

∎

Remark 5.

The above theorem implies that although there still exist non-binary stationary points, they are saddle points instead of local minimizers. Methods with momentum such as ADAGRAD and Adam can usually break out of saddle points and land on local minimizers.

Appendix B Further Implementation Details

In Algorithm 1, we use $S$ as the set of initializations we would like to solve. To solve this set of initializations, we choose batch size $K$ and number of batches $B$ , such that $|S|=KB$ . If we choose a large batch size $K$ , then we increase the time to first solution. Conversely, if we choose a large number of batches $B$ , then we decrease the number of initializations explored in a given batch, and potentially delay better solutions. Because of this relationship, $K$ and $B$ need to be chosen carefully. Our $K$ is 128 for SATLIB, 256 for ER, and 1024 for GNM. Our $B$ is 40 for SATLIB, 28 for ER, 10 for the GNM convergence results, and 5 for the GNM scalability results.

The results of $\mathsf{Quant}$ - $\mathsf{Net}$ and the baselines were obtained across three different machine configurations, with the runtime of the fastest configuration being reported. The NVIDIA A100 80GB PCIe machine utilizes an AMD EPYC 9554 CPU (30 vCores) with 236 GBs of DDR5-4800. The NVIDIA RTX4090 24GB machine utilizes an AMD EPYC 75F3 CPU (16 vCores) with 64 GBs of DDR4-2800. The NVIDIA RTX3070 8GB machine utilizes an Intel i9 12900K with 64 GBs of DDR5-6000. Note: the i9 has Intel Hyper-Threading and E-Cores disabled to maximize single core performance.

B.1 Efficient Implementation of MIS Checking

Based on the characteristics of the local minimizers of Problem (5), discussed in Lemma 6 and Theorem 5, we propose an efficient implementation to check whether a vector $\mathbf{x}\in[0,1]^{n}$ corresponds to a MIS. This means Line 7 in Algorithm 1.

We note that we need to check two subsequent conditions. The first is whether a binary vector corresponds to an IS (no nodes in IS contain any edges), and the second is whether this IS is maximal.

Given a vector $\mathbf{x}\in[0,1]^{n}$ , we obtain a binary representation of $\mathbf{x}$ , denoted by $\mathbf{x}$ denoted by $\mathbf{z}\in\{0,1\}^{n}$ , such that for all $v\in V$ , $\mathbf{z}_{v}=1$ if $\mathbf{x}_{v}>0$ , and $\mathbf{z}_{v}=0$ otherwise.

Based on the results of Appendix A, for some $\alpha>0$ , our MIS checking involves verifying whether the following equality is True.

\mathbf{z}=\mathrm{Proj}_{[0,1]^{n}}\Big{(}\mathbf{z}-\alpha\nabla_{\mathbf{x}% }f(\mathbf{z})\Big{)}=\mathrm{Proj}_{[0,1]^{n}}\Big{(}\mathbf{z}+\alpha\mathbf% {e}-\frac{\alpha}{2}\mathbf{z}^{T}(\gamma\mathbf{A}_{G}-\mathbf{A}_{G}^{\prime% })\mathbf{z}\Big{)}\>.

(18)

In (18), a simple projected gradient descent step to check whether $\mathbf{z}$ is at the boundary. We note that, computationally, this only requires a matrix-vector multiplication. Compared to the traditional method, which iterates over all the nodes in the MIS to check their neighbours, using (18) is 8X faster.

Appendix C Requirements Comparison with Baselines

In Table 2, we provide an overview comparison of the number of trainable parameters, hyper-parameters, and additional techniques needed for each baseline.

ReduMIS depends on a large set of graph reductions (see Section 3.1 in [3]) and graph clustering, which is used for solution improvement.

For learning-based methods, the parameters of a neural network architecture are optimized during training. This architecture is typically much larger than the number of input coordinates ( $>>n$ ). For instance, the network used in DIFUSCO consists of 12 layers, each with 5 trainable weight matrices. Each weight matrix is $256\times 256$ , resulting in $3932160$ trainable parameters for the SATLIB dataset (which has at most 1347 nodes).

Moreover, this dependence on training a NN introduces several hyper-parameters such as the number of layers, size of layers, choice of activation functions, etc.

It’s important to note that the choice of the sampler in iSCO introduces additional hyper-parameters. For instance, the PAS sampler [35] used in iSCO depends on the choice of the neighborhood function, a prior on the path length, and the choice of the probability of acceptance.

In terms of the number of optimization variables, $\mathsf{Quant}$ - $\mathsf{Net}$ only requires $n$ variables and a much-reduced number of hyper-parameters compared to iSCO.

Appendix D Additional Results

D.1 Convergence Plots with Fixed Run-time

In this section, we conduct an additional experiment to highlight the effectiveness of our proposed approach. Specifically, we use four ER (resp. GNM) graphs with $n=700$ (resp. $n=300$ ) and run our method and each baseline for a fixed run time of 14 (resp. 12) seconds. We then show the progress of the best obtained MIS over time. Figure 3 and Figure 4 present the results.

As observed, our method finds very good solutions early in the optimization process. For ER, within the 14-second time budget, we outperform the ILP commercial solvers in almost all cases. Additionally, ReduMIS takes 4 to 7 seconds to generate the first solution, whereas $\mathsf{Quant}$ - $\mathsf{Net}$ produces a good solution within the first second. For GNM, most methods reaches the 12 nodes mark. However, our method reaches the 12 node solution within the first or second second.

These convergence plots provide additional evidence of the scalability of $\mathsf{Quant}$ - $\mathsf{Net}$ .

Table 2: Requirements comparison with baselines. For the ILPs (Gurobi and CP-SAT), trainable parameters correspond to

n

binary decision variables. ReduMIS is not an optimization method. However, they use

n

binary variables, one for each node.

Method	Size	Hyper-Parameters	Additional Techniques/Procedures
ReduMIS	$n$ variables	N/A	Many graph reductions, and graph clustering
Gurobi	$n$ variables	N/A	N/A
CP-SAT	$n$ variables	N/A	N/A
GCN	$>>n$ trainable parameters	Many as it is learning-based	Tree Search
LwD	$>>n$ trainable parameters	Many as it is learning-based	Entropy Regularization
DIMES	$>>n$ trainable parameters	Many as it is learning-based	Tree Search or Sampling Decoding
DIFUSCO	$>>n$ trainable parameters	Many as it is learning-based	Greedy Decoding or Sampling Decoding
iSCO	$n$ variables	Temperature, Sampler, Chain length	Post Processing for Correction
$\mathsf{Quant}$ - $\mathsf{Net}$	$n$ trainable parameters	Learning rate, exploration parameter $\eta$ , number of steps $T$	Optional SDP initialization

D.2 Impact of the Compliment Graph term in $\mathsf{Quant}$ - $\mathsf{Net}$

In this subsection, we demonstrate the impact of incorporating the proposed compliment graph term. Specifically, we use three GNM graphs with $(n,m)=(100,2475)$ and run Algorithm 1 for $10,000$ iterations, both with ( $\gamma=n$ ) and without ( $\gamma=1.0001$ , similar to iSCO [34]) the complement graph term. Each time a solution is found, we sample from the uniform distribution and optimize using Adam until the $10,000$ iterations are complete. For both cases, we used an initial learning rate of 0.5. The results are presented in Figure 5.

As shown, when the third term is included, our algorithm finds larger MISs while requiring fewer iterations (the first three plots). The fourth plot illustrates the number of MISs found (which may not be unique) with and without the third term across the three graph instances (x-axis). It is evident that including the third term results in finding more than 100 solutions, whereas disabling the third term yields fewer than 5 solutions within the 10,000 iterations. This indicates that, given one initialization, utilizing the third term significantly accelerates the optimizer’s convergence to a local minimizer. Furthermore, fast convergence means that the number of initializations in the search sparse also increases which yields to improving the exploration.

Dataless Quadratic Neural Networks for the Maximum Independent Set Problem

Abstract

1 Introduction

2 Preliminaries and related work

2.1 The MIS problem formulations

Notations:

Problem Statement:

Definition 1 (MIS Problem).

Definition 2 (MC Problem).

2.2 Related work

3 𝖰𝗎𝖺𝗇𝗍𝖰𝗎𝖺𝗇𝗍\mathsf{Quant}sansserif_Quant-𝖭𝖾𝗍𝖭𝖾𝗍\mathsf{Net}sansserif_Net: Dataless quadratic neural networks for the MIS problem

3.1 𝖰𝗎𝖺𝗇𝗍𝖰𝗎𝖺𝗇𝗍\mathsf{Quant}sansserif_Quant-𝖭𝖾𝗍𝖭𝖾𝗍\mathsf{Net}sansserif_Net: The model.

Remark 1.

3.2 𝖰𝗎𝖺𝗇𝗍𝖰𝗎𝖺𝗇𝗍\mathsf{Quant}sansserif_Quant-𝖭𝖾𝗍𝖭𝖾𝗍\mathsf{Net}sansserif_Net: The training algorithm

3.3 𝖰𝗎𝖺𝗇𝗍𝖰𝗎𝖺𝗇𝗍\mathsf{Quant}sansserif_Quant-𝖭𝖾𝗍𝖭𝖾𝗍\mathsf{Net}sansserif_Net: Theoretical foundation

Definition 3 (MIS vector).

Theorem 4 (Necessary and Sufficient Condition on γ𝛾\gammaitalic_γ for MIS vectors to be local minimizers of Problem (5)).

Remark 2.

Theorem 5 (𝖰𝗎𝖺𝗇𝗍𝖰𝗎𝖺𝗇𝗍\mathsf{Quant}sansserif_Quant-𝖭𝖾𝗍𝖭𝖾𝗍\mathsf{Net}sansserif_Net Local Minimizers).

Remark 3.

Remark 4.

4 Experimental results

2) ER and SATLIB Benchmark Results:

3) Scalability Results:

5 Conclusion

References

Appendix A Proofs

A.1 Proof of Theorem 4

Proof.

A.2 Proof of Theorem 5

Lemma 6.

Proof.

Theorem 7 (Re-statement of Theorem 5).

Proof.

Remark 5.

Appendix B Further Implementation Details

B.1 Efficient Implementation of MIS Checking

Appendix C Requirements Comparison with Baselines

Appendix D Additional Results

D.1 Convergence Plots with Fixed Run-time

D.2 Impact of the Compliment Graph term in 𝖰𝗎𝖺𝗇𝗍𝖰𝗎𝖺𝗇𝗍\mathsf{Quant}sansserif_Quant-𝖭𝖾𝗍𝖭𝖾𝗍\mathsf{Net}sansserif_Net

3 $\mathsf{Quant}$ - $\mathsf{Net}$ : Dataless quadratic neural networks for the MIS problem

3.1 $\mathsf{Quant}$ - $\mathsf{Net}$ : The model.

3.2 $\mathsf{Quant}$ - $\mathsf{Net}$ : The training algorithm

3.3 $\mathsf{Quant}$ - $\mathsf{Net}$ : Theoretical foundation

Theorem 4 (Necessary and Sufficient Condition on $\gamma$ for MIS vectors to be local minimizers of Problem (5)).

Theorem 5 ( $\mathsf{Quant}$ - $\mathsf{Net}$ Local Minimizers).

D.2 Impact of the Compliment Graph term in $\mathsf{Quant}$ - $\mathsf{Net}$