License: CC BY 4.0
arXiv:2401.11188v1 [cs.LG] 20 Jan 2024

Fast and Exact Enumeration of Deep Networks Partitions Regions

Abstract

One fruitful formulation of Deep Networks (DNs) enabling their theoretical study and providing practical guidelines to practitioners relies on Piecewise Affine Splines. In that realm, a DN’s input-map** is expressed as per-region affine map** where those regions are implicitly determined by the model’s architecture and form a partition of their input space. That partition –which is involved in all the results spanned from this line of research– has so far only been computed on 2/3232/32 / 3-dimensional slices of the DN’s input space or estimated by random sampling. In this paper, we provide the first parallel algorithm that does exact enumeration of the DN’s partition regions. The proposed algorithm enables one to finally assess the closeness of the commonly employed approximations methods, e.g. based on random sampling of the DN input space. One of our key finding is that if one is only interested in regions with “large” volume, then uniform sampling of the space is highly efficient, but that if one is also interested in discovering the “small” regions of the partition, then uniform sampling is exponentially costly with the DN’s input space dimension. On the other hand, our proposed method has complexity scaling linearly with input dimension and the number of regions.

1 Introduction

Refer to caption
Refer to caption
Fig. 1: Proposed exact region enumeration depicted as an orange star against sampling-based region discovery of the partition Ωnormal-Ω\Omegaroman_Ω depicted as blue dots for a single hidden layer DN with leaky-ReLU, random parameters and width 64646464 as a function of computation time (x-axis) and number of partition regions found (y-axis); for a 4444-dimensional input space at the top and 8888-dimensional input space at the bottom. The proposed Algorithm 1 is able to dramatically outperform the sampling-based search that has been used throughout recent studies on CPA DNs.

Deep Networks (DNs) are compositions of linear and nonlinear operators altogether forming a differentiable functional f𝜽subscript𝑓𝜽f_{{\bm{\theta}}}italic_f start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT governed by some trainable parameters 𝜽𝜽{\bm{\theta}}bold_italic_θ [1]. Understanding the underlying properties that make DNs the great function approximators that they are involve many different research directions e.g. the underlying implicit regularization of architectures [2], or the impact of input and feature normalization into the optimization landscape [3]. Most existing results emerge from a few different mathematical formulations. One eponymous example relies on kernels and emerges from pushing the DN’s layers width to infinity. In this case, and under some additional assumptions, a closed-form expression for the DN’s underlying embedding space metric is obtained [4]. With that form, training dynamics and generalization bounds are amenable to theoretical analysis [5]. Another line of research considers the case of deep linear networks i.e. a DN without nonlinearities. In this setting, it is possible to obtain the explicit regularizer that acts upon the DN’s functional and that depends on the specifics of the architecture e.g. depth and with [6]. Another direction, most relevant to our study, proposes to unravel the Continuous Piecewise Affine (CPA) map** of standard DNs [7]. In short, one can combine the fact that (i) the nonlinearities present in most current DNs are themselves CPA e.g. (leaky-)ReLU, absolute value, max-pooling, (ii) the interleaved affine map**s preserve the CPA property, and (iii) composition of CPA map**s remain CPA. Thus, the entire input-output DN is itself a CPA. From that observation, it is possible to study the DN’s loss landscape [8], the implicit regularizer of different architectures [9], the explicit probabilistic distributions of CPA Deep Generative Networks [10, 11], the approximation rates [12, 13], or even the conditions for adversarial robustness guarantees [14, 15]. A striking benefit of the CPA viewpoint lies in the fact that it is an exact mathematical description of the DN’s input-output map** without any approximation nor simplification. This makes the obtained insights and guidelines highly relevant to improve currently deployed state-of-the-art architectures.

Despite this active recent development of CPA-based results around DNs, one key challenge remains open. In fact, because under this view one expresses the DN map** as a collection of affine map**s –one for each region ω𝜔\omegaitalic_ω of some partition ΩΩ\Omegaroman_Ω of their input space– it becomes crucial to compute that partition ΩΩ\Omegaroman_Ω or at least infer some statistics from it. Current analytical characterizations of ΩΩ\Omegaroman_Ω are in fact insufficient e.g. existing bounds characterizing the number of regions in ΩΩ\Omegaroman_Ω are known to be loose and uninformative [16]. As such, practitioners resort to simple approximation strategies, e.g. sampling, to estimate such properties of ΩΩ\Omegaroman_Ω. Another approach is to only consider 2/3232/32 / 3-dimensional slices of the DN’s input space and estimate ΩΩ\Omegaroman_Ω restricted on that subspace. All in all, nothing is known yet about how accurate are those approximations at conveying the underlying properties of the entire partition ΩΩ\Omegaroman_Ω that current theoretical results heavily rely on. In particular, [17] uses estimates of the partition’s number of region to perform Neural Architecture Search (NAS), and for which exact computation of the DNN’s partition regions will further improve the NAS; [11] uses estimates of the partition to adapt the distribution of deep generative networks (e.g. variational autoencoders) and for which exact computation of the partition would make their method exact, and not an approximation

In this paper, we propose a principled and provable enumeration method for DNs partitions (Algorithm 1) that we first develop for a layer-wise analysis in Section 2 and then extend to the multilayer case in Section 3. As depicted in Fig. 1, the proposed method becomes exponentially faster than the sampling-based strategy to discover the regions ωΩ𝜔Ω\omega\in\Omegaitalic_ω ∈ roman_Ω as the input dimensionality increases. Practically, the proposed enumeration method enables for the first time to measure the accuracy of the currently employed approximations. Our method is efficiently implemented with a few lines of codes, leverages parallel computations, and provably enumerates all the regions of the DN’s partition. Lastly, our method has linear asymptotic complexity with respect to the number of regions and with respect to the DN’s input space dimension. This property is crucial as we will demonstrate that sampling-based enumeration method has complexity growing exponentially with respect to the DN’s input space dimension as a direct consequence of the curse of dimensionality [18, 19]. We hope that our method will serve as the baseline algorithm for any application requiring provable partition region enumeration, or to assess the theoretical findings obtain from the CPA formulation of DNs.

2 Enumeration of Single-Layer Partitions

We now develop the enumeration algorithm for a single DN layer. Because a DN recursively subdivides the per-layer partition, the single layer case will be enough to iteratively compute the partition of a multilayer DN as shown in the next Section 3.

2.1 Layer Partitions and Hyperplane Arrangements

We denote the single layer of a DN111without loss of generality we consider the first layer, although the exact same analysis applies to any layer in the DN when looking at the partition of its own input space input-output map** as f𝜽:DK:subscript𝑓𝜽maps-tosuperscript𝐷superscript𝐾f_{{\bm{\theta}}}:\mathbb{R}^{D}\mapsto\mathbb{R}^{K}italic_f start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT : blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT ↦ blackboard_R start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT, with 𝜽𝜽{\bm{\theta}}bold_italic_θ the parameters of the map**. Without loss of generality, we consider vectors as inputs since when dealing with images, one can always flatten them into vectors and reparametrize the layer accordingly. The layer map** takes the form

f𝜽(𝒙)=σ(𝒉(𝒙)) with 𝒉(𝒙)=𝑾𝒙+𝒃subscript𝑓𝜽𝒙𝜎𝒉𝒙 with 𝒉𝒙𝑾𝒙𝒃\displaystyle f_{{\bm{\theta}}}({\bm{x}})=\sigma({\bm{h}}({\bm{x}}))\text{ % with }{\bm{h}}({\bm{x}})={\bm{W}}{\bm{x}}+{\bm{b}}italic_f start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_italic_x ) = italic_σ ( bold_italic_h ( bold_italic_x ) ) with bold_italic_h ( bold_italic_x ) = bold_italic_W bold_italic_x + bold_italic_b (1)

where σ𝜎\sigmaitalic_σ is a pointwise activation function, 𝑾𝑾{\bm{W}}bold_italic_W is a weight matrix of dimensions K×D𝐾𝐷K\times Ditalic_K × italic_D, 𝒃𝒃{\bm{b}}bold_italic_b is a bias vector of length K𝐾Kitalic_K, 𝒉(𝒙)𝒉𝒙{\bm{h}}({\bm{x}})bold_italic_h ( bold_italic_x ) denotes the pre-activation map and lastly 𝒙𝒙{\bm{x}}bold_italic_x is some input from Dsuperscript𝐷\mathbb{R}^{D}blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT. The layer parameters are thus 𝜽{𝑾,𝒃{\bm{\theta}}\triangleq\{{\bm{W}},{\bm{b}}bold_italic_θ ≜ { bold_italic_W , bold_italic_b. Although simple, Eq. 1 encompasses most current DNs layers by specifying the correct structural constraints on the matrix 𝑾𝑾{\bm{W}}bold_italic_W, e.g. to be circulant for a convolutional layer. The details on the layer map** will not impact our results. The CPA view of DNs [20, 7] consists in expressing Eq. 1 as

f𝜽(𝒙)=ωΩ(𝑨ω𝒙+𝒃ω)1{𝒛ω},subscript𝑓𝜽𝒙subscript𝜔Ωsubscript𝑨𝜔𝒙subscript𝒃𝜔subscript1𝒛𝜔\displaystyle f_{{\bm{\theta}}}({\bm{x}})=\sum_{\omega\in\Omega}({\bm{A}}_{% \omega}{\bm{x}}+{\bm{b}}_{\omega})1_{\{{\bm{z}}\in\omega\}},italic_f start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_italic_x ) = ∑ start_POSTSUBSCRIPT italic_ω ∈ roman_Ω end_POSTSUBSCRIPT ( bold_italic_A start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT bold_italic_x + bold_italic_b start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT ) 1 start_POSTSUBSCRIPT { bold_italic_z ∈ italic_ω } end_POSTSUBSCRIPT , (2)

where ΩΩ\Omegaroman_Ω is the layer input space partition [21]. Understanding the form of ΩΩ\Omegaroman_Ω will greatly help the development of the enumeration algorithm in Section 2.2. Given nonlinearities σ𝜎\sigmaitalic_σ such as (leaky-)ReLU or absolute value, it is direct to see that the layer stays linear for a region ω𝜔\omegaitalic_ω so that all the inputs within it have the same pre-activation signs. That is, a region is entirely and uniquely determined by those sign patterns

f𝜽 affine on ωsign(𝒉(𝒙))=sign(𝒉(𝒙)),(𝒙,𝒙)ω2,iffsubscript𝑓𝜽 affine on ωformulae-sequencesign𝒉𝒙sign𝒉superscript𝒙for-all𝒙superscript𝒙superscript𝜔2\displaystyle f_{{\bm{\theta}}}\text{ affine on $\omega$}\iff\operatorname{% sign}({\bm{h}}({\bm{x}}))=\operatorname{sign}({\bm{h}}({\bm{x}}^{\prime})),% \forall({\bm{x}},{\bm{x}}^{\prime})\in\omega^{2},italic_f start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT affine on italic_ω ⇔ roman_sign ( bold_italic_h ( bold_italic_x ) ) = roman_sign ( bold_italic_h ( bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) , ∀ ( bold_italic_x , bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∈ italic_ω start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ,

where the equality is to be understood elementwise on all of the K𝐾Kitalic_K entries of the sign vectors. The only exception arises for degenerate weights 𝑾𝑾{\bm{W}}bold_italic_W which we do not consider since any arbitrarily small perturbation of such degeneracies remove those edge cases. From the above observation along, it becomes clear that the transition between different regions of ΩΩ\Omegaroman_Ω must occur when a pre-activation sign for some unit k{1,,K}𝑘1𝐾k\in\{1,\dots,K\}italic_k ∈ { 1 , … , italic_K } changes, and because 𝒉𝒉{\bm{h}}bold_italic_h is nothing more but an affine map**, this sign change for some unit k𝑘kitalic_k can only occur when crossing the hyperplane

k{𝒙D:𝑾k,.,𝒙+𝒃k=0}.\displaystyle{\mathbb{H}}_{k}\triangleq\{{\bm{x}}\in\mathbb{R}^{D}:\langle{\bm% {W}}_{k,.},{\bm{x}}\rangle+{\bm{b}}_{k}=0\}.blackboard_H start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ≜ { bold_italic_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT : ⟨ bold_italic_W start_POSTSUBSCRIPT italic_k , . end_POSTSUBSCRIPT , bold_italic_x ⟩ + bold_italic_b start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = 0 } . (3)

Leveraging Eq. 3 we obtain that ΩΩ\partial\Omega∂ roman_Ω, the boundaries of the layer’s partition, is an hyperplane arrangement as in Ω=k=1KkΩsuperscriptsubscript𝑘1𝐾subscript𝑘\partial\Omega=\bigcup_{k=1}^{K}{\mathbb{H}}_{k}∂ roman_Ω = ⋃ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT blackboard_H start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT.

We are now able to leverage this particular structure of the layer’s partition to present an enumeration algorithm that will recursively search for all the regions ωΩ𝜔Ω\omega\in\Omegaitalic_ω ∈ roman_Ω.

2.2 Region Enumeration Algorithm

From the previous understanding that the layer’s partition arises from an hyperplane arrangements involving Eq. 3, we are now able to leverage and adapt existing enumeration methods for such partitions to obtain all the regions ωΩ𝜔Ω\omega\in\Omegaitalic_ω ∈ roman_Ω, form which it will become trivial to consider the multilayer case that we leave for the following Section 3.

Enumerating the regions of the layer f𝜽subscript𝑓𝜽f_{{\bm{\theta}}}italic_f start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT’s partition can be done efficiently by adapting existing reverse search algorithms [22] optimized for hyperplane arrangements. In fact, a naive approach of enumerating all of the 2Ksuperscript2𝐾2^{K}2 start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT possible sign patterns 𝒒{1,1}K𝒒superscript11𝐾{\bm{q}}\in\{-1,1\}^{K}bold_italic_q ∈ { - 1 , 1 } start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT and checking if each defines a non-empty region

k=1K{𝒙D:(𝑾k,.,𝒙+𝒃k)𝒒k0}=?,\displaystyle\bigcap_{k=1}^{K}\left\{{\bm{x}}\in\mathbb{R}^{D}:\left(\langle{% \bm{W}}_{k,.},{\bm{x}}\rangle+{\bm{b}}_{k}\right){\bm{q}}_{k}\geq 0\right\}% \overset{?}{=}\emptyset,⋂ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT { bold_italic_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT : ( ⟨ bold_italic_W start_POSTSUBSCRIPT italic_k , . end_POSTSUBSCRIPT , bold_italic_x ⟩ + bold_italic_b start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) bold_italic_q start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ≥ 0 } over? start_ARG = end_ARG ∅ ,

would be largely wasteful. In fact, most of such sign combinations do produce empty regions e.g. if the partition is central i.e. the intersection of all the hyperplane is not empty then the total number of regions grows linearly with K𝐾Kitalic_K [23] and is thus much smaller than 2Ksuperscript2𝐾2^{K}2 start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT. Instead, a much more efficient strategy is to only explore feasible sign patterns in a recursive tree-like structure. To do so, the algorithm recursively sub-divides a parent region by the hyperplane of unit k𝑘kitalic_k. If that hyperplane does not intersect the current region then we can skip unit k𝑘kitalic_k and recurse the sub-division of that same region by unit k+1𝑘1k+1italic_k + 1. On the other hand, if hyperplane k𝑘kitalic_k divides the current region, we consider both sides of it and keep the recursion going on both sides. We formally summarize the method in Algorithm 1 and present one illustrative example and comparison against sampling-based region enumeration in Fig. 1. In particular, we provide the efficiency of the sampling solution for various configurations in Table 1.

Algorithm 1 Proposed region enumeration method for the single hidden layer case that recursively searches over the feasible sign patterns 𝒒𝒒{\bm{q}}bold_italic_q one unit at a time, and only explores the branches that coincide with non-empty region i.e. avoiding the 2Ksuperscript2𝐾2^{K}2 start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT total number of possible of combinations. The step checking for intersection between an hyperplane and a given polytopal region can be done easily by setting up a linear program with dummy constant objective, the hyperplane as a linear constraint, and the polytopal region as inequality constraint; during the feasibility check the test will fail if the intersection is empty. This algorithm is obtained to provide the results from Figs. 1 and 1. The algorithm terminates once all the regions of the partition ΩΩ\Omegaroman_Ω have been visited.
1:𝑾K×D,𝒃K,k{1,,K},𝒒{1,0,1}kformulae-sequence𝑾superscript𝐾𝐷formulae-sequence𝒃superscript𝐾formulae-sequence𝑘1𝐾𝒒superscript101𝑘{\bm{W}}\in\mathbb{R}^{K\times D},{\bm{b}}\in\mathbb{R}^{K},k\in\{1,\dots,K\},% {\bm{q}}\in\{-1,0,1\}^{k}bold_italic_W ∈ blackboard_R start_POSTSUPERSCRIPT italic_K × italic_D end_POSTSUPERSCRIPT , bold_italic_b ∈ blackboard_R start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT , italic_k ∈ { 1 , … , italic_K } , bold_italic_q ∈ { - 1 , 0 , 1 } start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT
2:if 𝐤=?𝐊+𝟏𝐤?𝐊1{\bf k\overset{?}{=}K+1}bold_k over? start_ARG = end_ARG bold_K + bold_1 then this branch has reached a leaf, the sign pattern 𝒒𝒒{\bm{q}}bold_italic_q is feasible and can be accumulated into ΩΩ\Omegaroman_Ω’s current estimate
3:Check if the hyperplane defined by (𝒘k,𝒃k)subscript𝒘𝑘subscript𝒃𝑘({\bm{w}}_{k},{\bm{b}}_{k})( bold_italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_italic_b start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) intersects the polytopal region defined by j=1k1{𝒙D:(𝒘j,𝒙+𝒃j)𝒒j0}superscriptsubscript𝑗1𝑘1conditional-set𝒙superscript𝐷subscript𝒘𝑗𝒙subscript𝒃𝑗subscript𝒒𝑗0\bigcap_{j=1}^{k-1}\{{\bm{x}}\in\mathbb{R}^{D}:(\langle{\bm{w}}_{j},{\bm{x}}% \rangle+{\bm{b}}_{j}){\bm{q}}_{j}\geq 0\}⋂ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT { bold_italic_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT : ( ⟨ bold_italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , bold_italic_x ⟩ + bold_italic_b start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) bold_italic_q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ≥ 0 }
4:if NO then unit j𝑗jitalic_j is redundant, call the routine again with [𝒒j,0]subscript𝒒𝑗0[{\bm{q}}_{j},0][ bold_italic_q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , 0 ] as 𝒒𝒒{\bm{q}}bold_italic_q and k+1𝑘1k+1italic_k + 1 as k𝑘kitalic_k
5:if YES then unit j𝑗jitalic_j splits the region into two, call the routine again with [𝒒j,1]subscript𝒒𝑗1[{\bm{q}}_{j},1][ bold_italic_q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , 1 ] and k+1𝑘1k+1italic_k + 1 and [𝒒j,1]subscript𝒒𝑗1[{\bm{q}}_{j},-1][ bold_italic_q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , - 1 ] and k+1𝑘1k+1italic_k + 1
6:𝑿(L)superscript𝑿𝐿{\bm{X}}^{(L)}bold_italic_X start_POSTSUPERSCRIPT ( italic_L ) end_POSTSUPERSCRIPT\triangleright Evaluate loss and back-propagate as usual
Table 1: Comparison of our exact enumeration method versus sampling-based partition discovery for various single layer configurations with random weights and biases. The sampling-based discovery is run 5555 times and we report the average and standard deviation of the number of regions found after sampling. The number of input space sample is obtain so that the computation time of the proposed method is the same as the computation time of the sampling method i.e. for each configuration, both methods have run the exact same amount of time. We observe that for low-dimensional input space, and with the same fixed time-budget, both methods perform similarly and sampling is sufficient to quickly discover all of the layer’s partition.
input dim \width K=16 K=32 K=64 K=128 K=256
D=2 enumeration 16 13 71 127 631
sampling 16 ±plus-or-minus\pm±0 13 ±plus-or-minus\pm±0 67±plus-or-minus\pm±0 127±plus-or-minus\pm±0 611 ±plus-or-minus\pm±2
samp. found 100% 100 % 94 % 100 % 96 %
D=4 enumeration 54 80 1107 4271 95954
sampling 51 ±plus-or-minus\pm±0 69 ±plus-or-minus\pm± 1 866±plus-or-minus\pm±3 3288±plus-or-minus\pm±18 70635 ±plus-or-minus\pm±55
samp. found 94 % 86 % 78 % 77% 73%
D=8 enumeration 24 1242 8396 386566 -
sampling 18 ±plus-or-minus\pm±0 543±plus-or-minus\pm±2 2875±plus-or-minus\pm±5 136748±plus-or-minus\pm±251 -
samp. found 75 % 44 % 34 % 35 % -
Refer to caption
Fig. 2: Depiction of the multilayer case which corresponds to a union of region-constrained hyperplane arrangements and thus which can be studied directly form the proposed hyperplane arrangement region enumeration. The only additional step is to first enforce that the search takes place on the restricted region of interest from the up-to-layer-\ellroman_ℓ input space partition. For example on the left column one first obtains the first layer partition depicted in black. On each of the enumerated region, a subdivision will be performed by the next layer; pick any region of interest, compose the per-region affine map** (fixed on that region) with the second layer affine map**s, and repeat the region enumeration algorithm. This discovers the second subdivision done by the second layer, highlighted in blue in the middle column. This can be repeated to obtain the subdivision of the third layer, here highlighted in red in the right column.

3 Enumeration of Multi-Layers Partitions

This section demonstrates how the derivation carried out in Section 2 for the single layer setting is sufficient to enumerate the partition of a multilayer DN, thanks to the subdivision process under which the composition of many layers ultimately form the global DN’s input space partition. We first recall this subdivision step in Section 3.1 and summarize the enumeration algorithm in Section 3.2.

3.1 Deep Networks are Continuous Piecewise Affine

We specialize the per-layer notations from Section 2 by expliciting the layer index \ellroman_ℓ as f()superscript𝑓f^{(\ell)}italic_f start_POSTSUPERSCRIPT ( roman_ℓ ) end_POSTSUPERSCRIPT for the layer map**, as 𝜽()superscript𝜽{\bm{\theta}}^{(\ell)}bold_italic_θ start_POSTSUPERSCRIPT ( roman_ℓ ) end_POSTSUPERSCRIPT for its parameters, and the entire DN’s input-output map** is now referred to as f𝜽:DK:subscript𝑓𝜽maps-tosuperscript𝐷superscript𝐾f_{{\bm{\theta}}}:\mathbb{R}^{D}\mapsto\mathbb{R}^{K}italic_f start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT : blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT ↦ blackboard_R start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT with K𝐾Kitalic_K the output space dimension. The composition of layers take the form

f𝜽=(f𝜽(L)(L)f𝜽(1)(1)),subscript𝑓𝜽superscriptsubscript𝑓superscript𝜽𝐿𝐿superscriptsubscript𝑓superscript𝜽11\displaystyle f_{{\bm{\theta}}}=\left(f_{{\bm{\theta}}^{(L)}}^{(L)}\circ\dots% \circ f_{{\bm{\theta}}^{(1)}}^{(1)}\right),italic_f start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT = ( italic_f start_POSTSUBSCRIPT bold_italic_θ start_POSTSUPERSCRIPT ( italic_L ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_L ) end_POSTSUPERSCRIPT ∘ ⋯ ∘ italic_f start_POSTSUBSCRIPT bold_italic_θ start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ) , (4)

where each layer map** f():D()D(+1):superscript𝑓maps-tosuperscriptsuperscript𝐷superscriptsuperscript𝐷1f^{(\ell)}:\mathbb{R}^{D^{(\ell)}}\mapsto\mathbb{R}^{D^{(\ell+1)}}italic_f start_POSTSUPERSCRIPT ( roman_ℓ ) end_POSTSUPERSCRIPT : blackboard_R start_POSTSUPERSCRIPT italic_D start_POSTSUPERSCRIPT ( roman_ℓ ) end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ↦ blackboard_R start_POSTSUPERSCRIPT italic_D start_POSTSUPERSCRIPT ( roman_ℓ + 1 ) end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT produces a feature map; with D(1)Dsuperscript𝐷1𝐷D^{(1)}\triangleq Ditalic_D start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ≜ italic_D and D(L)Ksuperscript𝐷𝐿𝐾D^{(L)}\triangleq Kitalic_D start_POSTSUPERSCRIPT ( italic_L ) end_POSTSUPERSCRIPT ≜ italic_K; with map** given by Eq. 1, and 𝒉()superscript𝒉{\bm{h}}^{(\ell)}bold_italic_h start_POSTSUPERSCRIPT ( roman_ℓ ) end_POSTSUPERSCRIPT denoting the pre-activation map of layer \ellroman_ℓ. A key result from [20, 7] is the DN map** is itself defined on a partition as in

f𝜽(𝒙)=ωΩ(𝑨ω𝒙+𝒃ω)1{𝒛ω},subscript𝑓𝜽𝒙subscript𝜔Ωsubscript𝑨𝜔𝒙subscript𝒃𝜔subscript1𝒛𝜔\displaystyle f_{{\bm{\theta}}}({\bm{x}})=\sum_{\omega\in\Omega}({\bm{A}}_{% \omega}{\bm{x}}+{\bm{b}}_{\omega})1_{\{{\bm{z}}\in\omega\}},italic_f start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_italic_x ) = ∑ start_POSTSUBSCRIPT italic_ω ∈ roman_Ω end_POSTSUBSCRIPT ( bold_italic_A start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT bold_italic_x + bold_italic_b start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT ) 1 start_POSTSUBSCRIPT { bold_italic_z ∈ italic_ω } end_POSTSUBSCRIPT ,

which is known to be recursively built by each layer subdividing the previously built partition of the space [21].

3.2 Enumerating Union of Hyperplane Arrangements

Considering an arbitrarily deep model can be tackled by understanding the recurrent subdivision process of a two hidden layer DN and applying the same principle successively. In this setting, notice that for the (two-layer) DN to be affine within some region ω𝜔\omegaitalic_ω of the DN’s input space, each layer must stay affine as well. By composition the first layer staying linear does not ensure that the DN stays linear, but the first layer being nonlinear does imply that the entire DN is nonlinear. From that, we see that the first layer’s partition are “coarser” the the entire DN’s partition regions. More precisely, and following the derivation of [21], we obtain that each layer is a recursive subdivision of the previously build partition when in our case we need to search for each region ω𝜔\omegaitalic_ω of the first layer’s partition the regions within it where the second layer stays linear. As a result, the proposed single hidden layer enumeration method from Section 2 can be applied recursively as follows. First, compute the first layer partition enumeration. Then, for each enumerated region with corresponding sign pattern 𝒒𝒒{\bm{q}}bold_italic_q, define a new single layer model with 𝒉(𝒙)σ(𝑾(2)diag(𝒒)𝑾(1)𝒙+𝑾(2)(𝒒𝒃(1))+𝒃(2){\bm{h}}({\bm{x}})\triangleq\sigma({\bm{W}}^{(2)}\operatorname{diag}({\bm{q}})% {\bm{W}}^{(1)}{\bm{x}}+{\bm{W}}^{(2)}({\bm{q}}\odot{\bm{b}}^{(1)})+{\bm{b}}^{(% 2)}bold_italic_h ( bold_italic_x ) ≜ italic_σ ( bold_italic_W start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT roman_diag ( bold_italic_q ) bold_italic_W start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT bold_italic_x + bold_italic_W start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT ( bold_italic_q ⊙ bold_italic_b start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ) + bold_italic_b start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT and within ω𝜔\omegaitalic_ω apply the single layer enumeration; repeating the process for all regions –and corresponding sign patterns 𝒒𝒒{\bm{q}}bold_italic_q of the previously found first layer partition. This enumerates the partition of (f(2)f(1))superscript𝑓2superscript𝑓1(f^{(2)}\circ f^{(1)})( italic_f start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT ∘ italic_f start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ), and the same process can be repeated as many times as there are layers in the DN; as illustrated in Fig. 2.

4 Conclusion and Future Work

In this paper, we provided the first exact enumeration method for Deep Networks partitions that relies on the existing highly efficient enumeration method of hyperplane arrangements. In fact, both the hallow and deep architectures produce partitions that correspond to hyperplane arrangements or union of restricted hyperplane arrangements. A crucial finding that was enabled by the proposed method is that sampling-based region enumeration, which is the only strategy used in current research studies dealing with DNs and affine splines, is in fact relatively poor at finding the regions of the DN’s partition. In particular, when using such sampling to estimating some sensitive statistics e.g. the volume of the smallest region, sampling is biased and should be avoid in favor of an exact enumeration method.

References

  • [1] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton, “Deep learning,” nature, vol. 521, no. 7553, pp. 436–444, 2015.
  • [2] Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro, “In search of the real inductive bias: On the role of implicit regularization in deep learning,” arXiv preprint arXiv:1412.6614, 2014.
  • [3] Yann Le Cun, Ido Kanter, and Sara A Solla, “Eigenvalues of covariance matrices: Application to neural-network learning,” Physical Review Letters, vol. 66, no. 18, pp. 2396, 1991.
  • [4] Arthur Jacot, Franck Gabriel, and Clément Hongler, “Neural tangent kernel: Convergence and generalization in neural networks,” arXiv preprint arXiv:1806.07572, 2018.
  • [5] Kaixuan Huang, Yuqing Wang, Molei Tao, and Tuo Zhao, “Why do deep residual networks generalize better than deep feedforward networks?—a neural tangent kernel perspective,” Advances in neural information processing systems, vol. 33, pp. 2698–2709, 2020.
  • [6] Daniel Soudry, Elad Hoffer, Mor Shpigel Nacson, Suriya Gunasekar, and Nathan Srebro, “The implicit bias of gradient descent on separable data,” The Journal of Machine Learning Research, vol. 19, no. 1, pp. 2822–2878, 2018.
  • [7] Randall Balestriero and Richard Baraniuk, “A spline theory of deep learning,” in International Conference on Machine Learning. PMLR, 2018, pp. 374–383.
  • [8] Rudolf H Riedi, Randall Balestriero, and Richard G Baraniuk, “Singular value perturbation and deep network optimization,” arXiv preprint arXiv:2203.03099, 2022.
  • [9] Randall Balestriero and Richard G Baraniuk, “From hard to soft: Understanding deep network nonlinearities via vector quantization and statistical inference,” arXiv preprint arXiv:1810.09274, 2018.
  • [10] Randall Balestriero, Sébastien Paris, and Richard Baraniuk, “Analytical probability distributions and exact expectation-maximization for deep generative networks,” Advances in neural information processing systems, vol. 33, pp. 14938–14949, 2020.
  • [11] Ahmed Imtiaz Humayun, Randall Balestriero, and Richard Baraniuk, “Polarity sampling: Quality and diversity control of pre-trained generative networks via singular values,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 10641–10650.
  • [12] Ingrid Daubechies, Ronald DeVore, Simon Foucart, Boris Hanin, and Guergana Petrova, “Nonlinear approximation and (deep) relu networks,” Constructive Approximation, vol. 55, no. 1, pp. 127–172, 2022.
  • [13] Randall Balestriero and Richard G Baraniuk, “Batch normalization explained,” arXiv preprint arXiv:2209.14778, 2022.
  • [14] Lily Weng, Huan Zhang, Hongge Chen, Zhao Song, Cho-Jui Hsieh, Luca Daniel, Duane Boning, and Inderjit Dhillon, “Towards fast computation of certified robustness for relu networks,” in International Conference on Machine Learning. PMLR, 2018, pp. 5276–5285.
  • [15] Aditi Raghunathan, Jacob Steinhardt, and Percy S Liang, “Semidefinite relaxations for certifying robustness to adversarial examples,” Advances in Neural Information Processing Systems, vol. 31, 2018.
  • [16] Herbert Edelsbrunner, Algorithms in combinatorial geometry, vol. 10, Springer Science & Business Media, 1987.
  • [17] Wuyang Chen, Xinyu Gong, and Zhangyang Wang, “Neural architecture search on imagenet in four gpu hours: A theoretically inspired perspective,” arXiv preprint arXiv:2102.11535, 2021.
  • [18] Richard E Bellman and Stuart E Dreyfus, Applied dynamic programming, vol. 2050, Princeton university press, 2015.
  • [19] Mario Köppen, “The curse of dimensionality,” in 5th online world conference on soft computing in industrial applications (WSC5), 2000, vol. 1, pp. 4–8.
  • [20] Guido F Montufar, Razvan Pascanu, Kyunghyun Cho, and Yoshua Bengio, “On the number of linear regions of deep neural networks,” Advances in neural information processing systems, vol. 27, 2014.
  • [21] Randall Balestriero, Romain Cosentino, Behnaam Aazhang, and Richard Baraniuk, “The geometry of deep networks: Power diagram subdivision,” Advances in Neural Information Processing Systems, vol. 32, 2019.
  • [22] David Avis and Komei Fukuda, “Reverse search for enumeration,” Discrete applied mathematics, vol. 65, no. 1-3, pp. 21–46, 1996.
  • [23] Richard P Stanley et al., “An introduction to hyperplane arrangements,” Geometric combinatorics, vol. 13, no. 389-496, pp. 24, 2004.