-
Optimization Over Trained Neural Networks: Taking a Relaxing Walk
Authors:
Jiatai Tong,
Junyang Cai,
Thiago Serra
Abstract:
Besides training, mathematical optimization is also used in deep learning to model and solve formulations over trained neural networks for purposes such as verification, compression, and optimization with learned constraints. However, solving these formulations soon becomes difficult as the network size grows due to the weak linear relaxation and dense constraint matrix. We have seen improvements…
▽ More
Besides training, mathematical optimization is also used in deep learning to model and solve formulations over trained neural networks for purposes such as verification, compression, and optimization with learned constraints. However, solving these formulations soon becomes difficult as the network size grows due to the weak linear relaxation and dense constraint matrix. We have seen improvements in recent years with cutting plane algorithms, reformulations, and an heuristic based on Mixed-Integer Linear Programming (MILP). In this work, we propose a more scalable heuristic based on exploring global and local linear relaxations of the neural network model. Our heuristic is competitive with a state-of-the-art MILP solver and the prior heuristic while producing better solutions with increases in input, depth, and number of neurons.
△ Less
Submitted 28 January, 2024; v1 submitted 7 January, 2024;
originally announced January 2024.
-
Computational Tradeoffs of Optimization-Based Bound Tightening in ReLU Networks
Authors:
Fabian Badilla,
Marcos Goycoolea,
Gonzalo Muñoz,
Thiago Serra
Abstract:
The use of Mixed-Integer Linear Programming (MILP) models to represent neural networks with Rectified Linear Unit (ReLU) activations has become increasingly widespread in the last decade. This has enabled the use of MILP technology to test-or stress-their behavior, to adversarially improve their training, and to embed them in optimization models leveraging their predictive power. Many of these MIL…
▽ More
The use of Mixed-Integer Linear Programming (MILP) models to represent neural networks with Rectified Linear Unit (ReLU) activations has become increasingly widespread in the last decade. This has enabled the use of MILP technology to test-or stress-their behavior, to adversarially improve their training, and to embed them in optimization models leveraging their predictive power. Many of these MILP models rely on activation bounds. That is, bounds on the input values of each neuron. In this work, we explore the tradeoff between the tightness of these bounds and the computational effort of solving the resulting MILP models. We provide guidelines for implementing these models based on the impact of network structure, regularization, and rounding.
△ Less
Submitted 30 January, 2024; v1 submitted 27 December, 2023;
originally announced December 2023.
-
When Deep Learning Meets Polyhedral Theory: A Survey
Authors:
Joey Huchette,
Gonzalo Muñoz,
Thiago Serra,
Calvin Tsay
Abstract:
In the past decade, deep learning became the prevalent methodology for predictive modeling thanks to the remarkable accuracy of deep neural networks in tasks such as computer vision and natural language processing. Meanwhile, the structure of neural networks converged back to simpler representations based on piecewise constant and piecewise linear functions such as the Rectified Linear Unit (ReLU)…
▽ More
In the past decade, deep learning became the prevalent methodology for predictive modeling thanks to the remarkable accuracy of deep neural networks in tasks such as computer vision and natural language processing. Meanwhile, the structure of neural networks converged back to simpler representations based on piecewise constant and piecewise linear functions such as the Rectified Linear Unit (ReLU), which became the most commonly used type of activation function in neural networks. That made certain types of network structure $\unicode{x2014}$such as the typical fully-connected feedforward neural network$\unicode{x2014}$ amenable to analysis through polyhedral theory and to the application of methodologies such as Linear Programming (LP) and Mixed-Integer Linear Programming (MILP) for a variety of purposes. In this paper, we survey the main topics emerging from this fast-paced area of work, which bring a fresh perspective to understanding neural networks in more detail as well as to applying linear optimization techniques to train, verify, and reduce the size of such networks.
△ Less
Submitted 31 August, 2023; v1 submitted 29 April, 2023;
originally announced May 2023.
-
Getting Away with More Network Pruning: From Sparsity to Geometry and Linear Regions
Authors:
Junyang Cai,
Khai-Nguyen Nguyen,
Nishant Shrestha,
Aidan Good,
Ruisen Tu,
Xin Yu,
Shandian Zhe,
Thiago Serra
Abstract:
One surprising trait of neural networks is the extent to which their connections can be pruned with little to no effect on accuracy. But when we cross a critical level of parameter sparsity, pruning any further leads to a sudden drop in accuracy. This drop plausibly reflects a loss in model complexity, which we aim to avoid. In this work, we explore how sparsity also affects the geometry of the li…
▽ More
One surprising trait of neural networks is the extent to which their connections can be pruned with little to no effect on accuracy. But when we cross a critical level of parameter sparsity, pruning any further leads to a sudden drop in accuracy. This drop plausibly reflects a loss in model complexity, which we aim to avoid. In this work, we explore how sparsity also affects the geometry of the linear regions defined by a neural network, and consequently reduces the expected maximum number of linear regions based on the architecture. We observe that pruning affects accuracy similarly to how sparsity affects the number of linear regions and our proposed bound for the maximum number. Conversely, we find out that selecting the sparsity across layers to maximize our bound very often improves accuracy in comparison to pruning as much with the same sparsity in all layers, thereby providing us guidance on where to prune.
△ Less
Submitted 19 January, 2023;
originally announced January 2023.
-
Recall Distortion in Neural Network Pruning and the Undecayed Pruning Algorithm
Authors:
Aidan Good,
Jiaqi Lin,
Hannah Sieg,
Mikey Ferguson,
Xin Yu,
Shandian Zhe,
Jerzy Wieczorek,
Thiago Serra
Abstract:
Pruning techniques have been successfully used in neural networks to trade accuracy for sparsity. However, the impact of network pruning is not uniform: prior work has shown that the recall for underrepresented classes in a dataset may be more negatively affected. In this work, we study such relative distortions in recall by hypothesizing an intensification effect that is inherent to the model. Na…
▽ More
Pruning techniques have been successfully used in neural networks to trade accuracy for sparsity. However, the impact of network pruning is not uniform: prior work has shown that the recall for underrepresented classes in a dataset may be more negatively affected. In this work, we study such relative distortions in recall by hypothesizing an intensification effect that is inherent to the model. Namely, that pruning makes recall relatively worse for a class with recall below accuracy and, conversely, that it makes recall relatively better for a class with recall above accuracy. In addition, we propose a new pruning algorithm aimed at attenuating such effect. Through statistical analysis, we have observed that intensification is less severe with our algorithm but nevertheless more pronounced with relatively more difficult tasks, less complex models, and higher pruning ratios. More surprisingly, we conversely observe a de-intensification effect with lower pruning ratios, which indicates that moderate pruning may have a corrective effect to such distortions.
△ Less
Submitted 12 November, 2022; v1 submitted 6 June, 2022;
originally announced June 2022.
-
Optimal Decision Diagrams for Classification
Authors:
Alexandre M. Florio,
Pedro Martins,
Maximilian Schiffer,
Thiago Serra,
Thibaut Vidal
Abstract:
Decision diagrams for classification have some notable advantages over decision trees, as their internal connections can be determined at training time and their width is not bound to grow exponentially with their depth. Accordingly, decision diagrams are usually less prone to data fragmentation in internal nodes. However, the inherent complexity of training these classifiers acted as a long-stand…
▽ More
Decision diagrams for classification have some notable advantages over decision trees, as their internal connections can be determined at training time and their width is not bound to grow exponentially with their depth. Accordingly, decision diagrams are usually less prone to data fragmentation in internal nodes. However, the inherent complexity of training these classifiers acted as a long-standing barrier to their widespread adoption. In this context, we study the training of optimal decision diagrams (ODDs) from a mathematical programming perspective. We introduce a novel mixed-integer linear programming model for training and demonstrate its applicability for many datasets of practical importance. Further, we show how this model can be easily extended for fairness, parsimony, and stability notions. We present numerical analyses showing that our model allows training ODDs in short computational times, and that ODDs achieve better accuracy than optimal decision trees, while allowing for improved stability without significant accuracy losses.
△ Less
Submitted 28 May, 2022;
originally announced May 2022.
-
The Combinatorial Brain Surgeon: Pruning Weights That Cancel One Another in Neural Networks
Authors:
Xin Yu,
Thiago Serra,
Srikumar Ramalingam,
Shandian Zhe
Abstract:
Neural networks tend to achieve better accuracy with training if they are larger -- even if the resulting models are overparameterized. Nevertheless, carefully removing such excess parameters before, during, or after training may also produce models with similar or even improved accuracy. In many cases, that can be curiously achieved by heuristics as simple as removing a percentage of the weights…
▽ More
Neural networks tend to achieve better accuracy with training if they are larger -- even if the resulting models are overparameterized. Nevertheless, carefully removing such excess parameters before, during, or after training may also produce models with similar or even improved accuracy. In many cases, that can be curiously achieved by heuristics as simple as removing a percentage of the weights with the smallest absolute value -- even though magnitude is not a perfect proxy for weight relevance. With the premise that obtaining significantly better performance from pruning depends on accounting for the combined effect of removing multiple weights, we revisit one of the classic approaches for impact-based pruning: the Optimal Brain Surgeon(OBS). We propose a tractable heuristic for solving the combinatorial extension of OBS, in which we select weights for simultaneous removal, as well as a systematic update of the remaining weights. Our selection method outperforms other methods under high sparsity, and the weight update is advantageous even when combined with the other methods.
△ Less
Submitted 19 June, 2022; v1 submitted 8 March, 2022;
originally announced March 2022.
-
Training Thinner and Deeper Neural Networks: Jumpstart Regularization
Authors:
Carles Riera,
Camilo Rey,
Thiago Serra,
Eloi Puertas,
Oriol Pujol
Abstract:
Neural networks are more expressive when they have multiple layers. In turn, conventional training methods are only successful if the depth does not lead to numerical issues such as exploding or vanishing gradients, which occur less frequently when the layers are sufficiently wide. However, increasing width to attain greater depth entails the use of heavier computational resources and leads to ove…
▽ More
Neural networks are more expressive when they have multiple layers. In turn, conventional training methods are only successful if the depth does not lead to numerical issues such as exploding or vanishing gradients, which occur less frequently when the layers are sufficiently wide. However, increasing width to attain greater depth entails the use of heavier computational resources and leads to overparameterized models. These subsequent issues have been partially addressed by model compression methods such as quantization and pruning, some of which relying on normalization-based regularization of the loss function to make the effect of most parameters negligible. In this work, we propose instead to use regularization for preventing neurons from dying or becoming linear, a technique which we denote as jumpstart regularization. In comparison to conventional training, we obtain neural networks that are thinner, deeper, and - most importantly - more parameter-efficient.
△ Less
Submitted 5 June, 2022; v1 submitted 30 January, 2022;
originally announced January 2022.
-
Scaling Up Exact Neural Network Compression by ReLU Stability
Authors:
Thiago Serra,
Xin Yu,
Abhinav Kumar,
Srikumar Ramalingam
Abstract:
We can compress a rectifier network while exactly preserving its underlying functionality with respect to a given input domain if some of its neurons are stable. However, current approaches to determine the stability of neurons with Rectified Linear Unit (ReLU) activations require solving or finding a good approximation to multiple discrete optimization problems. In this work, we introduce an algo…
▽ More
We can compress a rectifier network while exactly preserving its underlying functionality with respect to a given input domain if some of its neurons are stable. However, current approaches to determine the stability of neurons with Rectified Linear Unit (ReLU) activations require solving or finding a good approximation to multiple discrete optimization problems. In this work, we introduce an algorithm based on solving a single optimization problem to identify all stable neurons. Our approach is on median 183 times faster than the state-of-art method on CIFAR-10, which allows us to explore exact compression on deeper (5 x 100) and wider (2 x 800) networks within minutes. For classifiers trained under an amount of L1 regularization that does not worsen accuracy, we can remove up to 56% of the connections on the CIFAR-10 dataset. The code is available at the following link, https://github.com/yuxwind/ExactCompression.
△ Less
Submitted 28 October, 2021; v1 submitted 15 February, 2021;
originally announced February 2021.
-
Lossless Compression of Deep Neural Networks
Authors:
Thiago Serra,
Abhinav Kumar,
Srikumar Ramalingam
Abstract:
Deep neural networks have been successful in many predictive modeling tasks, such as image and language recognition, where large neural networks are often used to obtain good accuracy. Consequently, it is challenging to deploy these networks under limited computational resources, such as in mobile devices. In this work, we introduce an algorithm that removes units and layers of a neural network wh…
▽ More
Deep neural networks have been successful in many predictive modeling tasks, such as image and language recognition, where large neural networks are often used to obtain good accuracy. Consequently, it is challenging to deploy these networks under limited computational resources, such as in mobile devices. In this work, we introduce an algorithm that removes units and layers of a neural network while not changing the output that is produced, which thus implies a lossless compression. This algorithm, which we denote as LEO (Lossless Expressiveness Optimization), relies on Mixed-Integer Linear Programming (MILP) to identify Rectified Linear Units (ReLUs) with linear behavior over the input domain. By using L1 regularization to induce such behavior, we can benefit from training over a larger architecture than we would later use in the environment where the trained neural network is deployed.
△ Less
Submitted 22 February, 2020; v1 submitted 1 January, 2020;
originally announced January 2020.
-
Template-based Minor Embedding for Adiabatic Quantum Optimization
Authors:
Thiago Serra,
Teng Huang,
Arvind Raghunathan,
David Bergman
Abstract:
Quantum Annealing (QA) can be used to quickly obtain near-optimal solutions for Quadratic Unconstrained Binary Optimization (QUBO) problems. In QA hardware, each decision variable of a QUBO should be mapped to one or more adjacent qubits in such a way that pairs of variables defining a quadratic term in the objective function are mapped to some pair of adjacent qubits. However, qubits have limited…
▽ More
Quantum Annealing (QA) can be used to quickly obtain near-optimal solutions for Quadratic Unconstrained Binary Optimization (QUBO) problems. In QA hardware, each decision variable of a QUBO should be mapped to one or more adjacent qubits in such a way that pairs of variables defining a quadratic term in the objective function are mapped to some pair of adjacent qubits. However, qubits have limited connectivity in existing QA hardware. This has spurred work on preprocessing algorithms for embedding the graph representing problem variables with quadratic terms into the hardware graph representing qubits adjacencies, such as the Chimera graph in hardware produced by D-Wave Systems. In this paper, we use integer linear programming to search for an embedding of the problem graph into certain classes of minors of the Chimera graph, which we call template embeddings. One of these classes corresponds to complete bipartite graphs, for which we show the limitation of the existing approach based on minimum Odd Cycle Transversals (OCTs). One of the formulations presented is exact, and thus can be used to certify the absence of a minor embedding using that template. On an extensive test set consisting of random graphs from five different classes of varying size and sparsity, we can embed more graphs than a state-of-the-art OCT-based approach, our approach scales better with the hardware size, and the runtime is generally orders of magnitude smaller.
△ Less
Submitted 19 January, 2021; v1 submitted 4 October, 2019;
originally announced October 2019.
-
Equivalent and Approximate Transformations of Deep Neural Networks
Authors:
Abhinav Kumar,
Thiago Serra,
Srikumar Ramalingam
Abstract:
Two networks are equivalent if they produce the same output for any given input. In this paper, we study the possibility of transforming a deep neural network to another network with a different number of units or layers, which can be either equivalent, a local exact approximation, or a global linear approximation of the original network. On the practical side, we show that certain rectified linea…
▽ More
Two networks are equivalent if they produce the same output for any given input. In this paper, we study the possibility of transforming a deep neural network to another network with a different number of units or layers, which can be either equivalent, a local exact approximation, or a global linear approximation of the original network. On the practical side, we show that certain rectified linear units (ReLUs) can be safely removed from a network if they are always active or inactive for any valid input. If we only need an equivalent network for a smaller domain, then more units can be removed and some layers collapsed. On the theoretical side, we constructively show that for any feed-forward ReLU network, there exists a global linear approximation to a 2-hidden-layer shallow network with a fixed number of units. This result is a balance between the increasing number of units for arbitrary approximation with a single layer and the known upper bound of $\lceil log(n_0+1)\rceil +1$ layers for exact representation, where $n_0$ is the input dimension. While the transformed network may require an exponential number of units to capture the activation patterns of the original network, we show that it can be made substantially smaller by only accounting for the patterns that define linear regions. Based on experiments with ReLU networks on the MNIST dataset, we found that $l_1$-regularization and adversarial training reduces the number of linear regions significantly as the number of stable units increases due to weight sparsity. Therefore, we can also intentionally train ReLU networks to allow for effective loss-less compression and approximation.
△ Less
Submitted 27 May, 2019;
originally announced May 2019.
-
Empirical Bounds on Linear Regions of Deep Rectifier Networks
Authors:
Thiago Serra,
Srikumar Ramalingam
Abstract:
We can compare the expressiveness of neural networks that use rectified linear units (ReLUs) by the number of linear regions, which reflect the number of pieces of the piecewise linear functions modeled by such networks. However, enumerating these regions is prohibitive and the known analytical bounds are identical for networks with same dimensions. In this work, we approximate the number of linea…
▽ More
We can compare the expressiveness of neural networks that use rectified linear units (ReLUs) by the number of linear regions, which reflect the number of pieces of the piecewise linear functions modeled by such networks. However, enumerating these regions is prohibitive and the known analytical bounds are identical for networks with same dimensions. In this work, we approximate the number of linear regions through empirical bounds based on features of the trained network and probabilistic inference. Our first contribution is a method to sample the activation patterns defined by ReLUs using universal hash functions. This method is based on a Mixed-Integer Linear Programming (MILP) formulation of the network and an algorithm for probabilistic lower bounds of MILP solution sets that we call MIPBound, which is considerably faster than exact counting and reaches values in similar orders of magnitude. Our second contribution is a tighter activation-based bound for the maximum number of linear regions, which is particularly stronger in networks with narrow layers. Combined, these bounds yield a fast proxy for the number of linear regions of a deep neural network.
△ Less
Submitted 14 December, 2019; v1 submitted 8 October, 2018;
originally announced October 2018.
-
When Lift-and-Project Cuts are Different
Authors:
Egon Balas,
Thiago Serra
Abstract:
In this paper, we present a method to determine if a lift-and-project cut for a mixed-integer linear program is irregular, in which case the cut is not equivalent to any intersection cut from the bases of the linear relaxation. This is an important question due to the intense research activity for the past decade on cuts from multiple rows of simplex tableau as well as on lift-and-project cuts fro…
▽ More
In this paper, we present a method to determine if a lift-and-project cut for a mixed-integer linear program is irregular, in which case the cut is not equivalent to any intersection cut from the bases of the linear relaxation. This is an important question due to the intense research activity for the past decade on cuts from multiple rows of simplex tableau as well as on lift-and-project cuts from non-split disjunctions. While it is known since Balas and Perregaard (2003) that lift-and-project cuts from split disjunctions are always equivalent to intersection cuts and consequently to such multi-row cuts, Balas and Kis (2016) have recently shown that there is a necessary and sufficient condition in the case of arbitrary disjunctions: a lift-and-project cut is regular if, and only if, it corresponds to a regular basic solution of the Cut Generating Linear Program (CGLP). This paper has four contributions. First, we state a result that simplifies the verification of regularity for basic CGLP solutions from Balas and Kis (2016). Second, we provide a mixed-integer formulation that checks whether there is a regular CGLP solution for a given cut that is regular in a broader sense, which also encompasses irregular cuts that are implied by the regular cut closure. Third, we describe a numerical procedure based on such formulation that identifies irregular lift-and-project cuts. Finally, we use this method to evaluate how often lift-and-project cuts from simple $t$-branch split disjunctions are irregular, and thus not equivalent to multi-row cuts, on 74 instances of the MIPLIB benchmarks.
△ Less
Submitted 24 January, 2020; v1 submitted 15 September, 2018;
originally announced September 2018.
-
How Could Polyhedral Theory Harness Deep Learning?
Authors:
Thiago Serra,
Christian Tjandraatmadja,
Srikumar Ramalingam
Abstract:
The holy grail of deep learning is to come up with an automatic method to design optimal architectures for different applications. In other words, how can we effectively dimension and organize neurons along the network layers based on the computational resources, input size, and amount of training data? We outline promising research directions based on polyhedral theory and mixed-integer represent…
▽ More
The holy grail of deep learning is to come up with an automatic method to design optimal architectures for different applications. In other words, how can we effectively dimension and organize neurons along the network layers based on the computational resources, input size, and amount of training data? We outline promising research directions based on polyhedral theory and mixed-integer representability that may offer an analytical approach to this question, in contrast to the empirical techniques often employed.
△ Less
Submitted 17 June, 2018;
originally announced June 2018.
-
Bounding and Counting Linear Regions of Deep Neural Networks
Authors:
Thiago Serra,
Christian Tjandraatmadja,
Srikumar Ramalingam
Abstract:
We investigate the complexity of deep neural networks (DNN) that represent piecewise linear (PWL) functions. In particular, we study the number of linear regions, i.e. pieces, that a PWL function represented by a DNN can attain, both theoretically and empirically. We present (i) tighter upper and lower bounds for the maximum number of linear regions on rectifier networks, which are exact for input…
▽ More
We investigate the complexity of deep neural networks (DNN) that represent piecewise linear (PWL) functions. In particular, we study the number of linear regions, i.e. pieces, that a PWL function represented by a DNN can attain, both theoretically and empirically. We present (i) tighter upper and lower bounds for the maximum number of linear regions on rectifier networks, which are exact for inputs of dimension one; (ii) a first upper bound for multi-layer maxout networks; and (iii) a first method to perform exact enumeration or counting of the number of regions by modeling the DNN with a mixed-integer linear formulation. These bounds come from leveraging the dimension of the space defining each linear region. The results also indicate that a deep rectifier network can only have more linear regions than every shallow counterpart with same number of neurons if that number exceeds the dimension of the input.
△ Less
Submitted 15 September, 2018; v1 submitted 6 November, 2017;
originally announced November 2017.