Fully invertible hyperbolic neural networks for segmenting large-scale surface and sub-surface data

Bas Peters
Computational Geosciences Inc.
Vancouver, BC
[email protected]
&Eldad Haber
Department of Earth, Ocean, and Atmospheric Sciences
The University of British Columbia
Vancouver, BC
\ANDKeegan Lensink
Department of Earth, Ocean, and Atmospheric Sciences
The University of British Columbia
Vancouver, BC
Abstract

The large spatial/temporal/frequency scale of geoscience and remote-sensing datasets causes memory issues when using convolutional neural networks for (sub-) surface data segmentation. Recently developed fully reversible or fully invertible networks can mostly avoid memory limitations by recomputing the states during the backward pass through the network. This results in a low and fixed memory requirement for storing network states, as opposed to the typical linear memory growth with network depth. This work focuses on a fully invertible network based on the telegraph equation. While reversibility saves the major amount of memory used in deep networks by the data, the convolutional kernels can take up most memory if fully invertible networks contain multiple invertible pooling/coarsening layers. We address the explosion of the number of convolutional kernels by combining fully invertible networks with layers that contain the convolutional kernels in a compressed form directly. A second challenge is that invertible networks output a tensor the same size as its input. This property prevents the straightforward application of invertible networks to applications that map between different input-output dimensions, need to map to outputs with more channels than present in the input data, or desire outputs that decrease/increase the resolution compared to the input data. However, we show that by employing invertible networks in a non-standard fashion, we can still use them for these tasks. Examples in hyperspectral land-use classification, airborne geophysical surveying, and seismic imaging illustrate that we can input large data volumes in one chunk and do not need to work on small patches, use dimensionality reduction, or employ methods that classify a patch to a single central pixel.

Keywords Invertible Neural Networks  \cdot Large Scale Deep Learning  \cdot Memory Efficient Deep Learning

1 Introduction

Many datasets in the imaging sciences are intrinsically 3D or 4D. For instance, the interpretation of seismic imagery, hyperspectral data segmentation for land-use classification, and segmentation of various medical imagery. To construct convolutional neural networks with a sufficiently large field-of-view (receptive field), we need deeper networks with more layers or multiple coarsening (pooling) and refinement stages in the network. Reasons we wish to work on large chunks of data instead of many small patches include wanting to learn from larger length scales that may be present in the data, as well as weakly supervised approaches that add prior knowledge or constraints related to properties of full images (Kervadec et al., 2019; Peters, 2022; Jia et al., 2021).

The dominant factor that limits the input data size and network depth is the storage of the network state, that is, the convolved data at each layer that is needed in order to compute a gradient of the loss function using back-propagation, implemented via reverse-mode automatic differentiation. Re-computing the network (forward) states in reverse order during back-propagation avoids this problem. This re-computation is possible when using fully invertible (also known as reversible) networks. Fully invertible networks have a constant memory requirement for states (activations) that is independent of network depth and the number of pooling stages, see Figure 2. Therefore, fully invertible networks largely avoid the memory limitations related to storing states.

Specifically, we employ a second-order hyperbolic differential equation based invertible network (Lensink et al., 2022). The connection with the wave equation enables the use of a suite of tools from analysis to interpretations that are commonly used in mathematical physics and numerical analysis. We note that other invertible network constructions exist, including invertible Hamiltonians (Ruthotto and Haber, 2018), invertible ResNets (Behrmann et al., 2019), invertible u-nets (Etmann et al., 2020). The fully invertible networks generally extend invertible networks for image classification (Jacobsen et al., 2018; van de Leemput et al., 2019) and networks that are only invertible in between coarsening/pooling stages (Ruthotto and Haber, 2018; Chang et al., 2018; Gomez et al., 2017; Dinh et al., 2016).

Fully invertible networks use invertible pooling/coarsening operations. Examples are the Haar transform (Lensink et al., 2022), reordering via a checkerboard pattern (Dinh et al., 2016; Jacobsen et al., 2018), or various learned coarsening operators (Lensink et al., 2022; Etmann et al., 2020). The invertible pooling causes the fully hyperbolic invertible network to be less flexible than some other networks for image segmentation in two ways. First, the convolutional kernels become the dominant memory consumer when the network contains several down-sampling operators. A fully invertible hyperbolic network needs to increase the number of channels by a factor of eight to change the resolution by a factor of two in each direction in 3D. This preservation of the number of elements in the tensors makes the coarsening and channel-count changes invertible operations. However, this approach leads to an ‘explosion’ of the channels, as remarked by Peters et al. (2019a); Etmann et al. (2020). For instance, if the input is three-channel RGB, there are 192192192192 channels after two coarsening layers and an astonishing 98304983049830498304 channels in case we wish to coarsen five times. The storage and computations of the associated 983042superscript98304298304^{2}98304 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT convolutional kernels (just for one layer at the coarsest level) would be completely unfeasible. Figure 2 illustrates this effect.

A second way in which the fully invertible hyperbolic network is a relatively ‘rigid’ design is that an orthogonal transform can increase and later decrease the number of channels by 8×8\times8 × per coarsening layer only, but we cannot arbitrarily reduce the number of channels and, thus network parameters. Similarly, one cannot increase the number of channels or make the network wider without decreasing resolution. This will often cause the network to contain too many or too few convolutional kernels for a certain task, given the network depth and the number of coarsening/refinement stages. The standard design of an invertible network outputs a tensor of the same size as its input and with the same number of channels. Therefore, one cannot directly apply invertible networks to applications like hyperspectral imaging that map 3D/4D inputs to a 2D output. Other applications that cannot directly work with invertible networks include ones that need to map to outputs with more channels than present in the input data or desire outputs that decrease/increase the resolution compared to the input data.

In this work, we present solutions to the above problems by combining the design of fully invertible networks with layers that can reduce the storage and computations related to the convolutional kernels. The same layer also serves as a way to increase the number of convolutional kernels per layer if required, without changing resolution. These two features make fully invertible hyperbolic networks much more flexible and fix the primary disadvantages. Furthermore, we show that we can, in fact, use invertible neural networks to change the resolution or the number of output channels while maintaining network invertibility.

Several examples illustrate how the presented tools enable the application of invertible neural networks to the following geoscientific problems: 1) time-lapse hyperspectral land-use-change detection, which maps 4D data to a 2D map; 2) large-scale 2D multi-model airborne-geophysical and remote sensing for aquifer map**, where we map from dozens of input channels to a couple of output classes; 3) geological model building from seismic data, where we set up the network so that it outputs a lower resolution compared to the input.

The examples illustrate that the developed tools extend the type of problems that can be handled using fully hyperbolic architectures, enabling training using larger data blocks as input for the network. This, in turn, allows us to work with higher-resolution inputs and learn larger-scale (spatial/temporal/harmonic) patterns that are present in the data.

1.1 Contributions

This work looks at some practical obstacles when applying fully invertible neural networks based on hyperbolic PDEs to large-scale remote sensing and geoscience problems. Specifically, we note our primary contributions as:

  • To the best of our knowledge, this is the first work that addresses the issue of the ‘exploding’ memory for convolutional kernels with an increasing number of resolution/pooling changes in a hyperbolic invertible network. Our solution keeps the network fully invertible while drastically reducing the number of convolutional kernels and associated memory and thus enables learning from data on a much larger scale than before while also working with arbitrarily deep networks.

  • We present a few subtle modifications that remove the limitation that fully invertible hyperbolic networks map between inputs and outputs of the same size/resolution and channel count while not giving up full invertibility and without increasing the computational cost or memory requirements.

After reviewing the design and some properties of the invertible hyperbolic neural network, we illustrate the limitations of the network structure. Then, we propose our solutions. Finally, we train on hyperspectral, multi-modality, and seismic datasets with sparse spatial label sampling.

2 Fully invertible hyperbolic neural networks for large input-output problems

The overwhelming majority of the literature relies on reverse-mode automatic differentiation for gradient computation. This type of gradient computation requires access to the network states 𝐘jsubscript𝐘𝑗\mathbf{Y}_{j}bold_Y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT at layer j𝑗jitalic_j during the backpropagation phase. Standard implementations keep all network states in memory, causing the memory footprint to grow linearly with network depth. Workarounds to reduce memory may rely on shallow networks or a network design that maps a small patch or data sub-volume into the class of the central pixel/voxel.

Fully invertible networks based on PDEs require memory for just a couple of layer states, depending on the discretization. So, there is no longer a need to trade off network depth for input size. Memory savings by using invertible architectures allow us to allocate all available memory towards larger data input volumes, enabling the network to learn from large-scale structures in the data.

As discussed in the introduction, out of the various fully invertible network designs, our focus is on the physics-inspired invertible network based on the non-linear Telegraph equation (Zhou and Luo, 2018) with time-step hhitalic_h. This equation is the basis for the invertible architecture of Ruthotto and Haber (2018); Chang et al. (2018).

2𝐘t2=f(𝐘,𝜽(t)),superscript2𝐘superscript𝑡2𝑓𝐘𝜽𝑡\displaystyle\frac{\partial^{2}{\bf Y}}{\partial t^{2}}=f({\bf Y},{\boldsymbol% {\theta}}(t)),divide start_ARG ∂ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_Y end_ARG start_ARG ∂ italic_t start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG = italic_f ( bold_Y , bold_italic_θ ( italic_t ) ) , (2.1)

where 𝐘𝐘{\bf Y}bold_Y is the state, f𝑓fitalic_f is a non-linear function, and the model parameters 𝜽(t)𝜽𝑡{\boldsymbol{\theta}}(t)bold_italic_θ ( italic_t ) are time dependent. The model parameters often parameterize convolutional, block-convolutional, differential and dense matrices, denoted as 𝐊(𝜽(t))𝐊𝜽𝑡{\bf K}({\boldsymbol{\theta}}(t))bold_K ( bold_italic_θ ( italic_t ) ). While many neural networks use nonlinearities of the type f(𝐘(t),𝐊(𝜽(t)))=σ(𝐘(t),𝐊(𝜽(t)))𝑓𝐘𝑡𝐊𝜽𝑡𝜎𝐘𝑡𝐊𝜽𝑡f({\bf Y}(t),{\bf K}({\boldsymbol{\theta}}(t)))=\sigma({\bf Y}(t),{\bf K}({% \boldsymbol{\theta}}(t)))italic_f ( bold_Y ( italic_t ) , bold_K ( bold_italic_θ ( italic_t ) ) ) = italic_σ ( bold_Y ( italic_t ) , bold_K ( bold_italic_θ ( italic_t ) ) ) with a nonlinear, point-wise, and monotonically increasing activation function σ::𝜎\sigma:\mathbb{R}\rightarrow\mathbb{R}italic_σ : blackboard_R → blackboard_R, we select the symmetric layer (Ruthotto and Haber, 2018) of the form

f(𝐘,𝜽(t))=𝐊(𝜽(t))σ(𝐊(𝜽(t))𝐘(t)).𝑓𝐘𝜽𝑡𝐊superscript𝜽𝑡top𝜎𝐊𝜽𝑡𝐘𝑡f({\bf Y},{\boldsymbol{\theta}}(t))=-{\bf K}({\boldsymbol{\theta}}(t))^{\top}% \sigma({\bf K}({\boldsymbol{\theta}}(t)){\bf Y}(t)).italic_f ( bold_Y , bold_italic_θ ( italic_t ) ) = - bold_K ( bold_italic_θ ( italic_t ) ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_σ ( bold_K ( bold_italic_θ ( italic_t ) ) bold_Y ( italic_t ) ) . (2.2)

With this choice, (2.1) becomes

2𝐘t2=𝐊(𝜽(t))σ(𝐊(𝜽(t))𝐘(t)).superscript2𝐘superscript𝑡2𝐊superscript𝜽𝑡top𝜎𝐊𝜽𝑡𝐘𝑡\frac{\partial^{2}{\bf Y}}{\partial t^{2}}=-{\bf K}({\boldsymbol{\theta}}(t))^% {\top}\sigma({\bf K}({\boldsymbol{\theta}}(t)){\bf Y}(t)).divide start_ARG ∂ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_Y end_ARG start_ARG ∂ italic_t start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG = - bold_K ( bold_italic_θ ( italic_t ) ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_σ ( bold_K ( bold_italic_θ ( italic_t ) ) bold_Y ( italic_t ) ) . (2.3)

The motivation for the specific symmetric layer choice relates to stability and energy conservation of the forward propagation through the network; see Ruthotto and Haber (2018) for a stability proof.

To proceed, we follow Ruthotto and Haber (2018) and use the conservative Leapfrog discretization of the second derivative of the state,

2𝐘t21h2(𝐘j+12𝐘j+𝐘j1),superscript2𝐘superscript𝑡21superscript2subscript𝐘𝑗12subscript𝐘𝑗subscript𝐘𝑗1\frac{\partial^{2}{\bf Y}}{\partial t^{2}}\approx{\frac{1}{h^{2}}}\left({\bf Y% }_{j+1}-2{\bf Y}_{j}+{\bf Y}_{j-1}\right),divide start_ARG ∂ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_Y end_ARG start_ARG ∂ italic_t start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ≈ divide start_ARG 1 end_ARG start_ARG italic_h start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ( bold_Y start_POSTSUBSCRIPT italic_j + 1 end_POSTSUBSCRIPT - 2 bold_Y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT + bold_Y start_POSTSUBSCRIPT italic_j - 1 end_POSTSUBSCRIPT ) , (2.4)

where hhitalic_h now indicates the artificial-time step and j𝑗jitalic_j indexes the discrete time. Combining the discretization of the second derivative and (2.3) results in

𝐘1=subscript𝐘1absent\displaystyle\mathbf{Y}_{1}=\>bold_Y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 𝐗,𝐘2=𝐗𝐗subscript𝐘2𝐗\displaystyle\mathbf{X},\quad\mathbf{Y}_{2}=\mathbf{X}bold_X , bold_Y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = bold_X
𝐘j=subscript𝐘𝑗absent\displaystyle\mathbf{Y}_{j}=\>bold_Y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = 2𝐘j1𝐘j2h2𝐊(𝜽j)σ(𝐊(𝜽j)𝐘j1)2subscript𝐘𝑗1subscript𝐘𝑗2superscript2𝐊superscriptsubscript𝜽𝑗top𝜎𝐊subscript𝜽𝑗subscript𝐘𝑗1\displaystyle 2\mathbf{Y}_{j-1}-\mathbf{Y}_{j-2}-h^{2}\mathbf{K}(\boldsymbol{% \theta}_{j})^{\top}\sigma(\mathbf{K}(\boldsymbol{\theta}_{j})\mathbf{Y}_{j-1})2 bold_Y start_POSTSUBSCRIPT italic_j - 1 end_POSTSUBSCRIPT - bold_Y start_POSTSUBSCRIPT italic_j - 2 end_POSTSUBSCRIPT - italic_h start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_K ( bold_italic_θ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_σ ( bold_K ( bold_italic_θ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) bold_Y start_POSTSUBSCRIPT italic_j - 1 end_POSTSUBSCRIPT ) (2.5)

Equation (2) is a hyperbolic network that uses a single resolution. The first two states are the initial conditions, which we set equal to the input data 𝐗n1×n2×n3×nchan𝐗superscriptsubscript𝑛1subscript𝑛2subscript𝑛3subscript𝑛chan\mathbf{X}\in\mathbb{R}^{n_{1}\times n_{2}\times n_{3}\times n_{\text{chan}}}bold_X ∈ blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT × italic_n start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT × italic_n start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT × italic_n start_POSTSUBSCRIPT chan end_POSTSUBSCRIPT end_POSTSUPERSCRIPT. The data tensor has nchansubscript𝑛chann_{\text{chan}}italic_n start_POSTSUBSCRIPT chan end_POSTSUBSCRIPT channels and 3 other dimensions indexed by n1subscript𝑛1n_{1}italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, n2subscript𝑛2n_{2}italic_n start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, and n3subscript𝑛3n_{3}italic_n start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT. Examples include hyperspectral data and 3D seismic image volumes.

An artificial time-step hhitalic_h affects the stability of the forward propagation (Haber and Ruthotto, 2017) via the well-known CFL condition (LeVeque, 1990). In this work, we set the linear operator 𝐊(𝜽j)𝐊subscript𝜽𝑗\mathbf{K}(\boldsymbol{\theta}_{j})bold_K ( bold_italic_θ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) to convolutions with kernels 𝜽jsubscript𝜽𝑗\boldsymbol{\theta}_{j}bold_italic_θ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, and select the ReLU as the activation function σ()𝜎\sigma(\cdot)italic_σ ( ⋅ ) for the examples.

In order to introduce multi-resolution into the system we use the approach by Lensink et al. (2022) and introduce the linear operators 𝐖jsubscript𝐖𝑗\mathbf{W}_{j}bold_W start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT that change the resolution of the state without losing information by moving the information from the spatial dimension to the channel dimension, obtaining the network

𝐘1=subscript𝐘1absent\displaystyle\mathbf{Y}_{1}=\>bold_Y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 𝐗,𝐘2=𝐗𝐗subscript𝐘2𝐗\displaystyle\mathbf{X},\quad\mathbf{Y}_{2}=\mathbf{X}bold_X , bold_Y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = bold_X
𝐘j=subscript𝐘𝑗absent\displaystyle\mathbf{Y}_{j}=\>bold_Y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = 2𝐖j1𝐘j1𝐖j2𝐘j2h2𝐊(𝜽j)σ(𝐊(𝜽j)𝐖j1𝐘j1),j=3,,n.formulae-sequence2subscript𝐖𝑗1subscript𝐘𝑗1subscript𝐖𝑗2subscript𝐘𝑗2superscript2𝐊superscriptsubscript𝜽𝑗top𝜎𝐊subscript𝜽𝑗subscript𝐖𝑗1subscript𝐘𝑗1𝑗3𝑛\displaystyle 2\mathbf{W}_{j-1}\mathbf{Y}_{j-1}-\mathbf{W}_{j-2}\mathbf{Y}_{j-% 2}-h^{2}\mathbf{K}(\boldsymbol{\theta}_{j})^{\top}\sigma(\mathbf{K}(% \boldsymbol{\theta}_{j})\mathbf{W}_{j-1}\mathbf{Y}_{j-1}),\>j=3,\cdots,n.2 bold_W start_POSTSUBSCRIPT italic_j - 1 end_POSTSUBSCRIPT bold_Y start_POSTSUBSCRIPT italic_j - 1 end_POSTSUBSCRIPT - bold_W start_POSTSUBSCRIPT italic_j - 2 end_POSTSUBSCRIPT bold_Y start_POSTSUBSCRIPT italic_j - 2 end_POSTSUBSCRIPT - italic_h start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_K ( bold_italic_θ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_σ ( bold_K ( bold_italic_θ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) bold_W start_POSTSUBSCRIPT italic_j - 1 end_POSTSUBSCRIPT bold_Y start_POSTSUBSCRIPT italic_j - 1 end_POSTSUBSCRIPT ) , italic_j = 3 , ⋯ , italic_n . (2.6)

The operator 𝐖𝐖\mathbf{W}bold_W represents coarsening/pooling to change the resolution and the number of channels simultaneously. Note that 𝐖jsubscript𝐖𝑗\mathbf{W}_{j}bold_W start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT equals the identity if there is no resolution change at layer j𝑗jitalic_j. Figure 1 illustrates an instance of the network design.

The operators 𝐖jsubscript𝐖𝑗\mathbf{W}_{j}bold_W start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT are important components of networks for image-to-image map**s, and they need to be invertible operators to enable full invertibility of the network. Practical invertible linear operators are orthogonal, do not require storage of their dense matrix representation, and have fast forward and inverse transforms known in closed form. Here, we select the orthogonal Haar wavelet transform (Truchetet and Laligant, 2004) 𝐖𝐖\mathbf{W}bold_W to coarsen the image and increase the number of channels simultaneously. This choice was used successfully in Lensink et al. (2022). The transpose (and inverse) achieves the reverse of these operations, 𝐖=𝐖1superscript𝐖topsuperscript𝐖1\mathbf{W}^{\top}=\mathbf{W}^{-1}bold_W start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT = bold_W start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT and 𝐖T𝐖=I=𝐖𝐖superscript𝐖𝑇𝐖𝐼superscript𝐖𝐖top\mathbf{W}^{T}\mathbf{W}=I=\mathbf{W}\mathbf{W}^{\top}bold_W start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_W = italic_I = bold_WW start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT, so the action of the linear operator 𝐖𝐖\mathbf{W}bold_W on a tensor 𝐘𝐘\mathbf{Y}bold_Y creates the map**s

𝐖𝐘𝐖𝐘\displaystyle\mathbf{W}\mathbf{Y}bold_WY :n1×n2×n3×nchann1/2×n2/2×n3/2×8nchan,:absentsuperscriptsubscript𝑛1subscript𝑛2subscript𝑛3subscript𝑛chansuperscriptsubscript𝑛12subscript𝑛22subscript𝑛328subscript𝑛chan\displaystyle:\mathbb{R}^{n_{1}\times n_{2}\times n_{3}\times n_{\text{chan}}}% \rightarrow\mathbb{R}^{n_{1}/2\times n_{2}/2\times n_{3}/2\times 8n_{\text{% chan}}},: blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT × italic_n start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT × italic_n start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT × italic_n start_POSTSUBSCRIPT chan end_POSTSUBSCRIPT end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT / 2 × italic_n start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT / 2 × italic_n start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT / 2 × 8 italic_n start_POSTSUBSCRIPT chan end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , (2.7)
𝐖1𝐘superscript𝐖1𝐘\displaystyle\mathbf{W}^{-1}\mathbf{Y}bold_W start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_Y :n1×n2×n3×nchan2n1×2n2×2n3×nchan/8.:absentsuperscriptsubscript𝑛1subscript𝑛2subscript𝑛3subscript𝑛chansuperscript2subscript𝑛12subscript𝑛22subscript𝑛3subscript𝑛chan8\displaystyle:\mathbb{R}^{n_{1}\times n_{2}\times n_{3}\times n_{\text{chan}}}% \rightarrow\mathbb{R}^{2n_{1}\times 2n_{2}\times 2n_{3}\times n_{\text{chan}}/% 8}.: blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT × italic_n start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT × italic_n start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT × italic_n start_POSTSUBSCRIPT chan end_POSTSUBSCRIPT end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT 2 italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT × 2 italic_n start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT × 2 italic_n start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT × italic_n start_POSTSUBSCRIPT chan end_POSTSUBSCRIPT / 8 end_POSTSUPERSCRIPT . (2.8)

Because of the invertibility of any orthogonal transform, applying 𝐖𝐖\mathbf{W}bold_W and 𝐖superscript𝐖top\mathbf{W}^{\top}bold_W start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT incurs no loss of information. As mentioned in the introduction, other invertible pooling or coarsening/refining operators are also available.

Refer to caption
Figure 1: Diagram of the flow of the multi-level hyperbolic network in (2). Shown for 3D input data with one channel and two levels. The network contains seven layers, with pooling after layer four and unpooling after layer six.

Invertibility of the full network is exploited by isolating one of the states in (2) and reversing the indices to obtain an expression for the current state in terms of future states:

𝐘j=𝐖j1[\displaystyle\mathbf{Y}_{j}=\mathbf{W}_{j}^{-1}\bigg{[}bold_Y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = bold_W start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT [ 2𝐖j+1𝐘j+1h2𝐊(𝜽j+2)σ(𝐊(𝜽j+2)𝐖j+1𝐘j+1)𝐘j+2].\displaystyle 2\mathbf{W}_{j+1}\mathbf{Y}_{j+1}-h^{2}\mathbf{K}(\boldsymbol{% \theta}_{j+2})^{\top}\sigma(\mathbf{K}(\boldsymbol{\theta}_{j+2})\mathbf{W}_{j% +1}\mathbf{Y}_{j+1})-\mathbf{Y}_{j+2}\bigg{]}.2 bold_W start_POSTSUBSCRIPT italic_j + 1 end_POSTSUBSCRIPT bold_Y start_POSTSUBSCRIPT italic_j + 1 end_POSTSUBSCRIPT - italic_h start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_K ( bold_italic_θ start_POSTSUBSCRIPT italic_j + 2 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_σ ( bold_K ( bold_italic_θ start_POSTSUBSCRIPT italic_j + 2 end_POSTSUBSCRIPT ) bold_W start_POSTSUBSCRIPT italic_j + 1 end_POSTSUBSCRIPT bold_Y start_POSTSUBSCRIPT italic_j + 1 end_POSTSUBSCRIPT ) - bold_Y start_POSTSUBSCRIPT italic_j + 2 end_POSTSUBSCRIPT ] . (2.9)

This equation does not require inverting the activation function σ𝜎\sigmaitalic_σ. Instead, only the inversion of the orthogonal wavelet transform is required, which is known in closed form. When computing the gradient of the loss function using backpropagation, we recompute the states 𝐘jsubscript𝐘𝑗\mathbf{Y}_{j}bold_Y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT while propagating backwards through the network. The re-computation avoids the storage of all 𝐘jsubscript𝐘𝑗\mathbf{Y}_{j}bold_Y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT and leads to a fixed memory requirement for the states of three layers, see Table 1 for an overview.

Table 1: Memory requirements for the states 𝐘jsubscript𝐘𝑗\mathbf{Y}_{j}bold_Y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT and convolutional kernels 𝜽𝜽\boldsymbol{\theta}bold_italic_θ for fully invertible and non-invertible equivalent networks based on the networks in Table 2.
Memory States Conv. kernels
Hyperspectral
non invertible 22.522.522.522.5 GB 0.020.020.020.02 GB
invertible 3.73.73.73.7 GB 0.020.020.020.02 GB
invertible + BLR layers 3.73.73.73.7 GB 0.0030.0030.0030.003 GB
Aquifer map**
non invertible 21.0221.0221.0221.02 GB 32.1932.1932.1932.19 GB
invertible 1.661.661.661.66 GB 32.1932.1932.1932.19 GB
invertible + BLR layers 1.661.661.661.66 GB 0.120.120.120.12 GB
3D seismic
non invertible 21.9621.9621.9621.96 GB 41.1641.1641.1641.16 GB
invertible 2.192.192.192.19 GB 41.1641.1641.1641.16 GB
invertible + BLR layers 2.192.192.192.19 GB 0.230.230.230.23 GB
Refer to caption
Figure 2: Memory requirements (Gigabyte) for network states (activations) and 3×3×33333\times 3\times 33 × 3 × 3 convolutional kernels. Left: as a function of input size and a fixed 50505050 layer network with two coarsening stages. Middle: as a function of an increasing number of layers but with fixed input size (3003superscript3003300^{3}300 start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT) and fixed number of two coarsenings. Right: as a function of an increasing number of coarsening steps but with a fixed number of layers (50505050) and input size (3003superscript3003300^{3}300 start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT). Our proposed Block-Low-Rank (BLR) layers avoid an exploding number of convolutional kernels with increased coarsening in invertible networks.

The benefits of the hyperbolic network design range beyond just invertibility and memory savings. We point out that the network design described in this section can include instance/batch/layer normalization. However, we found it unnecessary for our examples. The following stability property from (Ruthotto and Haber, 2018, Thm. 2) can partially explain this observation.

Theorem 2.1 (Stability of the fully invertible hyperbolic network (2.3) and (2).).

The neural network satisfies the stability criterion

𝐲1(t=T)𝐲2(t=T)22c𝐲1(t=0)𝐲2(t=0)22superscriptsubscriptnormsubscript𝐲1𝑡𝑇subscript𝐲2𝑡𝑇22𝑐superscriptsubscriptnormsubscript𝐲1𝑡0subscript𝐲2𝑡022\|{\bf y}_{1}(t=T)-{\bf y}_{2}(t=T)\|_{2}^{2}\leq c\|{\bf y}_{1}(t=0)-{\bf y}_% {2}(t=0)\|_{2}^{2}∥ bold_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_t = italic_T ) - bold_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_t = italic_T ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ italic_c ∥ bold_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_t = 0 ) - bold_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_t = 0 ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (2.10)

where 𝐲1subscript𝐲1{\bf y}_{1}bold_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and 𝐲2subscript𝐲2{\bf y}_{2}bold_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT are two different initial states, given at times equal to zero and after propagating to time T𝑇Titalic_T. The constant c>0𝑐0c>0italic_c > 0 is independent of t𝑡titalic_t.

The above theorem is stated in (Ruthotto and Haber, 2018, Thm. 2) for time-independent weights 𝐊𝐊\mathbf{K}bold_K. Without stating the full proof, we illuminate the structure of the hyperbolic network, stability, and connection to other physical wave-equations. To start, rewrite (2.3) into a first-order system of equations by introducing the state 𝐯(t)𝐯𝑡{\bf v}(t)bold_v ( italic_t ) and using 𝐊𝐮(t)=t𝐯(t)𝐊𝐮𝑡subscript𝑡𝐯𝑡{\bf K}{\bf u}(t)=\partial_{t}{\bf v}(t)bold_Ku ( italic_t ) = ∂ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_v ( italic_t ) and 𝐊𝐯(t)=t𝐮(t)superscript𝐊top𝐯𝑡subscript𝑡𝐮𝑡-{\bf K}^{\top}{\bf v}(t)=\partial_{t}{\bf u}(t)- bold_K start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_v ( italic_t ) = ∂ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_u ( italic_t ) to arrive at

t[𝐮(t)𝐯(t)]subscript𝑡matrix𝐮𝑡𝐯𝑡\displaystyle\partial_{t}\begin{bmatrix}{\bf u}(t)\\ {\bf v}(t)\end{bmatrix}∂ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ start_ARG start_ROW start_CELL bold_u ( italic_t ) end_CELL end_ROW start_ROW start_CELL bold_v ( italic_t ) end_CELL end_ROW end_ARG ] =[I00σ()][0𝐊𝐊0][𝐮(t)𝐯(t)]absentmatrix𝐼00𝜎matrix0superscript𝐊top𝐊0matrix𝐮𝑡𝐯𝑡\displaystyle=\begin{bmatrix}I&0\\ 0&\sigma(\cdot)\end{bmatrix}\begin{bmatrix}0&-{\bf K}^{\top}\\ {\bf K}&0\end{bmatrix}\begin{bmatrix}{\bf u}(t)\\ {\bf v}(t)\end{bmatrix}= [ start_ARG start_ROW start_CELL italic_I end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL italic_σ ( ⋅ ) end_CELL end_ROW end_ARG ] [ start_ARG start_ROW start_CELL 0 end_CELL start_CELL - bold_K start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL bold_K end_CELL start_CELL 0 end_CELL end_ROW end_ARG ] [ start_ARG start_ROW start_CELL bold_u ( italic_t ) end_CELL end_ROW start_ROW start_CELL bold_v ( italic_t ) end_CELL end_ROW end_ARG ] (2.11)
t𝐱(𝐭)=𝐁𝐀𝐱(𝐭).absentsubscript𝑡𝐱𝐭𝐁𝐀𝐱𝐭\displaystyle\leftrightarrow\partial_{t}\bf{\bf x}(t)={\bf B}\circ{\bf A}{\bf x% }(t).↔ ∂ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_x ( bold_t ) = bold_B ∘ bold_Ax ( bold_t ) .

The above immediately shows that the initial choice of network non-linearity 𝐊(t)σ(𝐊(t)𝐘(t))𝐊superscript𝑡top𝜎𝐊𝑡𝐘𝑡-{\bf K}(t)^{\top}\sigma({\bf K}(t){\bf Y}(t))- bold_K ( italic_t ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_σ ( bold_K ( italic_t ) bold_Y ( italic_t ) ) leads to the anti-hermitian operator, or skew-symmetric matrix property A=Asuperscript𝐴top𝐴A^{\top}=-Aitalic_A start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT = - italic_A. This property, in combination with 2.11, leads to a standard stability argument via the upper bound of the energy

𝐁𝐀𝐱(t)22𝐀𝐱(t)22𝐀22𝐱(t)22=c𝐱(t)22,superscriptsubscriptnorm𝐁𝐀𝐱𝑡22superscriptsubscriptnorm𝐀𝐱𝑡22superscriptsubscriptnorm𝐀22superscriptsubscriptnorm𝐱𝑡22𝑐superscriptsubscriptnorm𝐱𝑡22\|{\bf B}\circ{\bf A}{\bf x}(t)\|_{2}^{2}\leq\|{\bf A}{\bf x}(t)\|_{2}^{2}\leq% \|{\bf A}\|_{2}^{2}\|{\bf x}(t)\|_{2}^{2}=c\|{\bf x}(t)\|_{2}^{2},∥ bold_B ∘ bold_Ax ( italic_t ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ ∥ bold_Ax ( italic_t ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ ∥ bold_A ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ bold_x ( italic_t ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = italic_c ∥ bold_x ( italic_t ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (2.12)

where the nonlinearity satisfies |σ(𝐱)||𝐱|𝜎𝐱𝐱|\sigma({\bf x})|\leq|{\bf x}|| italic_σ ( bold_x ) | ≤ | bold_x | and c𝑐citalic_c is the constant 𝐀22superscriptsubscriptnorm𝐀22\|{\bf A}\|_{2}^{2}∥ bold_A ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. This energy definition is equivalent to the one in (Evans, 2010, sect. 2.4.3) and Ruthotto and Haber (2018). The upper bound is the energy associated with the linear wave-like equation. The energy of a linear wave-like equation is constant in time, as shown by a standard argument using the anti-hermitian 𝐀𝐀{\bf A}bold_A as follows:

𝐱(t)22tsuperscriptsubscriptnorm𝐱𝑡22𝑡\displaystyle\frac{\partial\|{\bf x}(t)\|_{2}^{2}}{\partial t}divide start_ARG ∂ ∥ bold_x ( italic_t ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG ∂ italic_t end_ARG =t(𝐱(t),𝐱(t))=(t𝐱(t),𝐱(t))+(𝐱(t),t𝐱(t))absent𝑡𝐱𝑡𝐱𝑡subscript𝑡𝐱𝑡𝐱𝑡𝐱𝑡subscript𝑡𝐱𝑡\displaystyle=\frac{\partial}{\partial t}({\bf x}(t),{\bf x}(t))=(\partial_{t}% {\bf x}(t),{\bf x}(t))+({\bf x}(t),\partial_{t}{\bf x}(t))= divide start_ARG ∂ end_ARG start_ARG ∂ italic_t end_ARG ( bold_x ( italic_t ) , bold_x ( italic_t ) ) = ( ∂ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_x ( italic_t ) , bold_x ( italic_t ) ) + ( bold_x ( italic_t ) , ∂ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_x ( italic_t ) )
=(𝐀𝐱(t),𝐱(t))+(𝐱(t),𝐀𝐱(t))absent𝐀𝐱𝑡𝐱𝑡𝐱𝑡𝐀𝐱𝑡\displaystyle=({\bf A}{\bf x}(t),{\bf x}(t))+({\bf x}(t),{\bf A}{\bf x}(t))= ( bold_Ax ( italic_t ) , bold_x ( italic_t ) ) + ( bold_x ( italic_t ) , bold_Ax ( italic_t ) )
=(𝐀𝐱(t),𝐱(t))(𝐀𝐱(t),𝐱(t))=0absent𝐀𝐱𝑡𝐱𝑡𝐀𝐱𝑡𝐱𝑡0\displaystyle=({\bf A}{\bf x}(t),{\bf x}(t))-({\bf A}{\bf x}(t),{\bf x}(t))=0= ( bold_Ax ( italic_t ) , bold_x ( italic_t ) ) - ( bold_Ax ( italic_t ) , bold_x ( italic_t ) ) = 0 (2.13)

where the norm is induced by the inner product 𝐱22=(𝐱,𝐱)superscriptsubscriptnorm𝐱22𝐱𝐱\|{\bf x}\|_{2}^{2}=({\bf x},{\bf x})∥ bold_x ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = ( bold_x , bold_x ).

Thus, the energy of the hyperbolic network, in between resolution changes, is bounded from above by the energy of the linear wave-like equation, which itself is bounded. The orthogonal wavelet transforms also do not change the energy across resolution levels.

Below, we will verify the stability property for randomly initialized and trained networks.

2.1 Empirical verification of stability properties

The inherited stability of the invertible hyperbolic network as stated above suggests it is possible to train the networks for the examples without using any instance/batch/layer normalization. Here, we illustrate the stability, as well as invertibility and energy growth, for both randomly initialized and trained networks. Figure 3 displays empirically observed network properties for an untrained version of the (30303030 layer, four-level) network described in Table 3, for a range of artificial time-step sizes hhitalic_h. The function g(𝐗,𝜽)𝑔𝐗𝜽g({\bf X},{\boldsymbol{\theta}})italic_g ( bold_X , bold_italic_θ ) denotes the propagation of input 𝐗𝐗{\bf X}bold_X through the network, and g1superscript𝑔1g^{-1}italic_g start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT denotes the inverse/reverse propagation. The figures show the following properties:

  • Energy growth: g(𝐗0,𝜽)2𝐗02subscriptnorm𝑔subscript𝐗0𝜽2subscriptnormsubscript𝐗02\frac{\|g({\bf X}_{0},{\boldsymbol{\theta}})\|_{2}}{\|{\bf X}_{0}\|_{2}}divide start_ARG ∥ italic_g ( bold_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , bold_italic_θ ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG ∥ bold_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG, which is the ratio of the norms of network input and output.

  • Stability: g(𝐗,𝜽)g(𝐗+δ𝐗,𝜽)2δ𝐗2subscriptnorm𝑔𝐗𝜽𝑔𝐗𝛿𝐗𝜽2subscriptnorm𝛿𝐗2\frac{\|g({\bf X},{\boldsymbol{\theta}})-g({\bf X}+\delta{\bf X},{\boldsymbol{% \theta}})\|_{2}}{\|\delta{\bf X}\|_{2}}divide start_ARG ∥ italic_g ( bold_X , bold_italic_θ ) - italic_g ( bold_X + italic_δ bold_X , bold_italic_θ ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG ∥ italic_δ bold_X ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG, where δ𝐗𝛿𝐗\delta{\bf X}italic_δ bold_X is a random perturbation (selected with a magnitude of 0.1𝐗20.1subscriptnorm𝐗20.1\|{\bf X}\|_{2}0.1 ∥ bold_X ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT for the current experiment).

  • Invertibility error: g1(g(𝐗,𝜽),𝜽)𝐗2𝐗2subscriptnormsuperscript𝑔1𝑔𝐗𝜽𝜽𝐗2subscriptnorm𝐗2\frac{\|g^{-1}(g({\bf X},{\boldsymbol{\theta}}),{\boldsymbol{\theta}})-{\bf X}% \|_{2}}{\|{\bf X}\|_{2}}divide start_ARG ∥ italic_g start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_g ( bold_X , bold_italic_θ ) , bold_italic_θ ) - bold_X ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG ∥ bold_X ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG

The results show that a network initialized with zero mean and normally distributed weights is stable, preserves energy relatively well, and is invertible in practice. For energy growth and stability, a U-net with the same number of levels, layers per level, and the same number of channels resulted in exploding outputs. We note that the results depend on the scaling of the random initialization and that a smart initialization like Glorot and Bengio (2010) can also improve the stability at the start of training. It can be verified that our network shows similarly good results for a wide range of scalings of random initial weights.

The same test after training the network revealed the following numbers for each property: energy growth: 7.427.427.427.42, which is a reasonably small number (note that energy growth is not expected to be smaller than one, see (2.12)); stability: mean of 1.051.051.051.05 with std 0.0040.0040.0040.004; invertibility error: 0.000180.000180.000180.00018. These numbers illustrate the favorable stability, energy growth and invertibility properties are preserved while training the network.

Refer to caption
Figure 3: Empirically observed properties of an untrained and randomly initialized invertible hyperbolic network, as a function of the ‘time-step’ hhitalic_h. The perturbation for the middle figure is also chosen randomly for every test point.

3 Problems and solutions for practical invertible networks

In the following, we propose solutions to the limitations of fully invertible multi-level hyperbolic networks as posed in the introduction: an exploding number of convolutional kernels, inability to deal with different input-output resolutions and channel numbers, and transformations between different data-label dimensions. Below, we address each of these problems and present solutions such that we can apply hyperbolic invertible networks anyway, without fundamentally altering the network design or giving up invertibility.

3.1 Exponential parameter growth with the number of resolution changes - a block low-rank approach

Orthogonal transforms that make neural networks fully invertible also preserve the number of entries after pooling/coarsening and unpooling/refining the state. Indeed, equations (2.7) and (2.8) show that if there are nchan2superscriptsubscript𝑛chan2n_{\text{chan}}^{2}italic_n start_POSTSUBSCRIPT chan end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT convolutional kernels per network layer, starting with 3D input data with nchan (ini)subscript𝑛chan (ini)n_{\text{chan (ini)}}italic_n start_POSTSUBSCRIPT chan (ini) end_POSTSUBSCRIPT channels leads to

nchan=nchan (ini)×8ncoarseningsubscript𝑛chansubscript𝑛chan (ini)superscript8subscript𝑛coarseningn_{\text{chan}}=n_{\text{chan (ini)}}\times 8^{n_{\text{coarsening}}}italic_n start_POSTSUBSCRIPT chan end_POSTSUBSCRIPT = italic_n start_POSTSUBSCRIPT chan (ini) end_POSTSUBSCRIPT × 8 start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT coarsening end_POSTSUBSCRIPT end_POSTSUPERSCRIPT (3.14)

channels after coarsening/pooling ncoarseningsubscript𝑛coarseningn_{\text{coarsening}}italic_n start_POSTSUBSCRIPT coarsening end_POSTSUBSCRIPT times. Equivalently, each time the network states coarsen, the number of convolutional kernels grows with a factor of 64646464. See Figure 2 for an illustration of this effect for various network designs as a function of input size, number of layers, and the number of coarsening steps. As noted by Peters et al. (2019a); Etmann et al. (2020), this results in prohibitive memory and computation times for storing and computing convolutional kernels, respectively.

From the above observations and Figure 2, it is clear that invertible hyperbolic networks with multiple coarsening stages can only be a practical tool if it is possible to reduce the memory requirements for the network parameters. Various memory reduction techniques for network weights have been developed, and below we discuss a few that are/are not suitable for our purposes. We also introduce a novel network layer defined by a low block rank.

To start, consider network compression methods such as pruning and low-rank matrix/tensor factorization (Denton et al., 2014) train a ’full’ network first, followed by compression of the weights. These ideas do not apply to our situation because we assume that it is infeasible to train the full network. Instead, we need methods that directly train a reduced-memory network. Prior work includes replacing a part of the convolutional kernels by scalars (Ephrath et al., 2019), and equip** the convolutional kernels with a block-circulant structure for parameter reduction (Ding et al., 2017; Treister et al., 2018), as well as employing Kronecker-product structures (Wu, 2016).

Here, we introduce another layer type that requires minimal adaptation to existing code, has easily adaptable memory requirements, and has a clear linear-algebraic interpretation. Inspired by techniques like LR-factorization for matrix completion (Rennie and Srebro, 2005; Aravkin et al., 2014), we construct layers with a block-low-rank structure formed by the convolutional kernels when structured in its flattened matrix form. This structure allows for explicit limitations on the number of convolutions in a layer while preserving a possibly extremely large number of channels. Our approach is similar to ‘squeeze-and-expand’ bottleneck methods like Szegedy et al. (2015); He et al. (2016); Iandola et al. (2016) proposed for non-invertible residual networks, mainly in the context of image classification. Our layer differs because we have one instead of two nonlinearities per layer, and our convolutional kernels have a symmetric structure (see Eq. (2.2)), which squeezenet does not have. If the nonlinearity is not symmetric, it will not induce an anti-hermitian operator in 2.1 and, therefore, cannot guarantee stability.

Also different is that we do not rely on 1×1×11111\times 1\times 11 × 1 × 1 convolutions for compression and expansion. Because we explicitly induce and recognize the block-low-rank structure in our network layer, we can directly observe various linear-algebraic properties of interest.

Linear algebraic matrix-vector product notation is required to interpret and construct the block-low-rank layer. The computational implementation can still use a 5D tensor format for the numerical examples. First, flatten the tensors that contain network states of type nx×ny×nz×nchansuperscriptsubscript𝑛𝑥subscript𝑛𝑦subscript𝑛𝑧subscript𝑛chan\mathbb{R}^{n_{x}\times n_{y}\times n_{z}\times n_{\text{chan}}}blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT × italic_n start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT × italic_n start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT × italic_n start_POSTSUBSCRIPT chan end_POSTSUBSCRIPT end_POSTSUPERSCRIPT to block-vectors nxnynznchansuperscriptsubscript𝑛𝑥subscript𝑛𝑦subscript𝑛𝑧subscript𝑛chan\mathbb{R}^{n_{x}n_{y}n_{z}n_{\text{chan}}}blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT chan end_POSTSUBSCRIPT end_POSTSUPERSCRIPT that contain nchansubscript𝑛chann_{\text{chan}}italic_n start_POSTSUBSCRIPT chan end_POSTSUBSCRIPT sub-vectors Yinxnynzsuperscript𝑌𝑖superscriptsubscript𝑛𝑥subscript𝑛𝑦subscript𝑛𝑧Y^{i}\in\mathbb{R}^{n_{x}n_{y}n_{z}}italic_Y start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT end_POSTSUPERSCRIPT,

𝐘[Y1Y2Ynchan].𝐘matrixsuperscript𝑌1superscript𝑌2superscript𝑌subscript𝑛chan\mathbf{Y}\equiv\begin{bmatrix}Y^{1}\\ Y^{2}\\ \vdots\\ Y^{n_{\text{chan}}}\end{bmatrix}.bold_Y ≡ [ start_ARG start_ROW start_CELL italic_Y start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL italic_Y start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL end_ROW start_ROW start_CELL italic_Y start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT chan end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_CELL end_ROW end_ARG ] . (3.15)

Similarly, we rewrite the convolutional kernels for a given layer in tensor format nx×ny×nz×nchan out×nchan insuperscriptsubscript𝑛𝑥subscript𝑛𝑦subscript𝑛𝑧subscript𝑛chan outsubscript𝑛chan in\mathbb{R}^{n_{x}\times n_{y}\times n_{z}\times n_{\text{chan out}}\times n_{% \text{chan in}}}blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT × italic_n start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT × italic_n start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT × italic_n start_POSTSUBSCRIPT chan out end_POSTSUBSCRIPT × italic_n start_POSTSUBSCRIPT chan in end_POSTSUBSCRIPT end_POSTSUPERSCRIPT as the block matrix 𝐊𝐊\mathbf{K}bold_K with nchan outsubscript𝑛chan outn_{\text{chan out}}italic_n start_POSTSUBSCRIPT chan out end_POSTSUBSCRIPT rows and nchan insubscript𝑛chan inn_{\text{chan in}}italic_n start_POSTSUBSCRIPT chan in end_POSTSUBSCRIPT columns. Each block K(θi,k)𝐾superscript𝜃𝑖𝑘K(\theta^{i,k})italic_K ( italic_θ start_POSTSUPERSCRIPT italic_i , italic_k end_POSTSUPERSCRIPT ) is a Toeplitz matrix representation of the convolution with a kernel θi,ksuperscript𝜃𝑖𝑘\theta^{i,k}italic_θ start_POSTSUPERSCRIPT italic_i , italic_k end_POSTSUPERSCRIPT,

𝐊[K(θ1,1)K(θ1,2)K(θ1,nchan in)K(θ2,1)K(θ2,2)K(θ2,nchan in)K(θnchan out,1)K(θnchan out,2)K(θnchan out,nchan in)].𝐊matrix𝐾superscript𝜃11𝐾superscript𝜃12𝐾superscript𝜃1subscript𝑛chan in𝐾superscript𝜃21𝐾superscript𝜃22𝐾superscript𝜃2subscript𝑛chan in𝐾superscript𝜃subscript𝑛chan out1𝐾superscript𝜃subscript𝑛chan out2𝐾superscript𝜃subscript𝑛chan outsubscript𝑛chan in\mathbf{K}\equiv\begin{bmatrix}K(\theta^{1,1})&K(\theta^{1,2})&\ldots&K(\theta% ^{1,n_{\text{chan in}}})\\ K(\theta^{2,1})&K(\theta^{2,2})&\ldots&K(\theta^{2,n_{\text{chan in}}})\\ \vdots&\vdots&\ddots&\vdots\\ K(\theta^{n_{\text{chan out}},1})&K(\theta^{n_{\text{chan out}},2})&\ldots&K(% \theta^{n_{\text{chan out}},n_{\text{chan in}}})\\ \end{bmatrix}.bold_K ≡ [ start_ARG start_ROW start_CELL italic_K ( italic_θ start_POSTSUPERSCRIPT 1 , 1 end_POSTSUPERSCRIPT ) end_CELL start_CELL italic_K ( italic_θ start_POSTSUPERSCRIPT 1 , 2 end_POSTSUPERSCRIPT ) end_CELL start_CELL … end_CELL start_CELL italic_K ( italic_θ start_POSTSUPERSCRIPT 1 , italic_n start_POSTSUBSCRIPT chan in end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) end_CELL end_ROW start_ROW start_CELL italic_K ( italic_θ start_POSTSUPERSCRIPT 2 , 1 end_POSTSUPERSCRIPT ) end_CELL start_CELL italic_K ( italic_θ start_POSTSUPERSCRIPT 2 , 2 end_POSTSUPERSCRIPT ) end_CELL start_CELL … end_CELL start_CELL italic_K ( italic_θ start_POSTSUPERSCRIPT 2 , italic_n start_POSTSUBSCRIPT chan in end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL start_CELL ⋮ end_CELL start_CELL ⋱ end_CELL start_CELL ⋮ end_CELL end_ROW start_ROW start_CELL italic_K ( italic_θ start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT chan out end_POSTSUBSCRIPT , 1 end_POSTSUPERSCRIPT ) end_CELL start_CELL italic_K ( italic_θ start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT chan out end_POSTSUBSCRIPT , 2 end_POSTSUPERSCRIPT ) end_CELL start_CELL … end_CELL start_CELL italic_K ( italic_θ start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT chan out end_POSTSUBSCRIPT , italic_n start_POSTSUBSCRIPT chan in end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) end_CELL end_ROW end_ARG ] . (3.16)

This block matrix is square if the number of channels remains unchanged after the convolutions. To reduce the number of channels, 𝐊𝐊\mathbf{K}bold_K needs to be a flat matrix; a tall block-matrix increases the number of channels. The collection of convolutional kernels at layer j𝑗jitalic_j is denoted by 𝜽jsubscript𝜽𝑗\boldsymbol{\theta}_{j}bold_italic_θ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT.

Using this notation, the non-linear part of each layer in the hyperbolic network (2) becomes (as an example, consider a case with three input channels in 𝐘jsubscript𝐘𝑗{\bf Y}_{j}bold_Y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT but using only six convolutional kernels instead of the usual nine)

[K(θ1,1)K(θ2,1)K(θ1,2)K(θ2,2)K(θ1,3)K(θ2,3)]σ([K(θ1,1)K(θ1,2)K(θ1,3)K(θ2,1)K(θ2,2)K(θ2,3)][Y1Y2Y3]).matrix𝐾superscriptsuperscript𝜃11top𝐾superscriptsuperscript𝜃21top𝐾superscriptsuperscript𝜃12top𝐾superscriptsuperscript𝜃22top𝐾superscriptsuperscript𝜃13top𝐾superscriptsuperscript𝜃23top𝜎matrix𝐾superscript𝜃11𝐾superscript𝜃12𝐾superscript𝜃13𝐾superscript𝜃21𝐾superscript𝜃22𝐾superscript𝜃23matrixsuperscript𝑌1superscript𝑌2superscript𝑌3\begin{bmatrix}K(\theta^{1,1})^{\top}&K(\theta^{2,1})^{\top}\\ K(\theta^{1,2})^{\top}&K(\theta^{2,2})^{\top}\\ K(\theta^{1,3})^{\top}&K(\theta^{2,3})^{\top}\end{bmatrix}\sigma\Bigg{(}\begin% {bmatrix}K(\theta^{1,1})&K(\theta^{1,2})&K(\theta^{1,3})\\ K(\theta^{2,1})&K(\theta^{2,2})&K(\theta^{2,3})\\ \end{bmatrix}\begin{bmatrix}Y^{1}\\ Y^{2}\\ Y^{3}\\ \end{bmatrix}\Bigg{)}.[ start_ARG start_ROW start_CELL italic_K ( italic_θ start_POSTSUPERSCRIPT 1 , 1 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT end_CELL start_CELL italic_K ( italic_θ start_POSTSUPERSCRIPT 2 , 1 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL italic_K ( italic_θ start_POSTSUPERSCRIPT 1 , 2 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT end_CELL start_CELL italic_K ( italic_θ start_POSTSUPERSCRIPT 2 , 2 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL italic_K ( italic_θ start_POSTSUPERSCRIPT 1 , 3 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT end_CELL start_CELL italic_K ( italic_θ start_POSTSUPERSCRIPT 2 , 3 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT end_CELL end_ROW end_ARG ] italic_σ ( [ start_ARG start_ROW start_CELL italic_K ( italic_θ start_POSTSUPERSCRIPT 1 , 1 end_POSTSUPERSCRIPT ) end_CELL start_CELL italic_K ( italic_θ start_POSTSUPERSCRIPT 1 , 2 end_POSTSUPERSCRIPT ) end_CELL start_CELL italic_K ( italic_θ start_POSTSUPERSCRIPT 1 , 3 end_POSTSUPERSCRIPT ) end_CELL end_ROW start_ROW start_CELL italic_K ( italic_θ start_POSTSUPERSCRIPT 2 , 1 end_POSTSUPERSCRIPT ) end_CELL start_CELL italic_K ( italic_θ start_POSTSUPERSCRIPT 2 , 2 end_POSTSUPERSCRIPT ) end_CELL start_CELL italic_K ( italic_θ start_POSTSUPERSCRIPT 2 , 3 end_POSTSUPERSCRIPT ) end_CELL end_ROW end_ARG ] [ start_ARG start_ROW start_CELL italic_Y start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL italic_Y start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL italic_Y start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT end_CELL end_ROW end_ARG ] ) . (3.17)

For simplicity, we assumed there is no coarsening via the Haar transform at this particular layer. The symmetric layer structure enables a layer to have the same number of input and output channels while working with fewer convolutional kernels than nchan2superscriptsubscript𝑛chan2n_{\text{chan}}^{2}italic_n start_POSTSUBSCRIPT chan end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT if 𝐊𝐊\mathbf{K}bold_K is a tall block-matrix.

A benefit of our proposed construction is the following clear linear-algebraic interpretation:

The structure 𝐊jσ(𝐊j𝐘j)superscriptsubscript𝐊𝑗top𝜎subscript𝐊𝑗subscript𝐘𝑗{\bf K}_{j}^{\top}\sigma({\bf K}_{j}{\bf Y}_{j})bold_K start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_σ ( bold_K start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT bold_Y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) using a block matrix 𝐊𝐊\mathbf{K}bold_K with m×n𝑚𝑛m\times nitalic_m × italic_n blocks and every block is a convolution matrix, induces several properties, including

  • the number of convolutional kernels in 𝐊𝐊\mathbf{K}bold_K is given by mn𝑚𝑛mnitalic_m italic_n, and the number of convolutions plus transposed convolutions is 2mn2𝑚𝑛2mn2 italic_m italic_n per layer.

  • the block rank of 𝐊𝐊superscript𝐊top𝐊\mathbf{K}^{\top}\mathbf{K}bold_K start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_K is at most m𝑚mitalic_m.

  • the number of unique kernels in 𝐊𝐊superscript𝐊top𝐊\mathbf{K}^{\top}\mathbf{K}bold_K start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_K is at most (n2+n)/2superscript𝑛2𝑛2(n^{2}+n)/2( italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_n ) / 2.

Another key observation is that a non-square 𝐊𝐊\mathbf{K}bold_K does not affect the invertibility of the network because the rank-deficient matrix 𝐊𝐊superscript𝐊top𝐊\mathbf{K}^{\top}\mathbf{K}bold_K start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_K does not need to be inverted for the reverse propagation as defined in (2.9).

Including the symmetric block-low-rank layer extends the applicability of fully invertible hyperbolic networks to larger input data and an increased number of coarsening stages in the network. See Figure 2 for a summary of the memory savings.

3.2 Invertible hyperbolic networks with a different number of input and output channels.

To segment data into nclasssubscript𝑛classn_{\text{class}}italic_n start_POSTSUBSCRIPT class end_POSTSUBSCRIPT classes, one typically employs a neural network that maps the data to the last network state (output) 𝐘nx×ny×nz×nchan𝐘superscriptsubscript𝑛𝑥subscript𝑛𝑦subscript𝑛𝑧subscript𝑛chan\mathbf{Y}\in\mathbb{R}^{n_{x}\times n_{y}\times n_{z}\times n_{\text{chan}}}bold_Y ∈ blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT × italic_n start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT × italic_n start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT × italic_n start_POSTSUBSCRIPT chan end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, where we would choose nclass=nchansubscript𝑛classsubscript𝑛chann_{\text{class}}=n_{\text{chan}}italic_n start_POSTSUBSCRIPT class end_POSTSUBSCRIPT = italic_n start_POSTSUBSCRIPT chan end_POSTSUBSCRIPT. The standard accompanying loss function is the multi-class cross-entropy loss

L(𝐘,𝐂)=(i,j,k)l=1nchan𝐂i,j,k,llog(𝐘i,j,k,l),𝐿𝐘𝐂subscript𝑖𝑗𝑘superscriptsubscript𝑙1subscript𝑛chansubscript𝐂𝑖𝑗𝑘𝑙subscript𝐘𝑖𝑗𝑘𝑙L(\mathbf{Y},\mathbf{C})=-\sum_{(i,j,k)}\sum_{l=1}^{n_{\text{chan}}}\mathbf{C}% _{i,j,k,l}\log(\mathbf{Y}_{i,j,k,l}),italic_L ( bold_Y , bold_C ) = - ∑ start_POSTSUBSCRIPT ( italic_i , italic_j , italic_k ) end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT chan end_POSTSUBSCRIPT end_POSTSUPERSCRIPT bold_C start_POSTSUBSCRIPT italic_i , italic_j , italic_k , italic_l end_POSTSUBSCRIPT roman_log ( bold_Y start_POSTSUBSCRIPT italic_i , italic_j , italic_k , italic_l end_POSTSUBSCRIPT ) , (3.18)

where 𝐂nx×ny×nz×nclass𝐂superscriptsubscript𝑛𝑥subscript𝑛𝑦subscript𝑛𝑧subscript𝑛class\mathbf{C}\in\mathbb{R}^{n_{x}\times n_{y}\times n_{z}\times n_{\text{class}}}bold_C ∈ blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT × italic_n start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT × italic_n start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT × italic_n start_POSTSUBSCRIPT class end_POSTSUBSCRIPT end_POSTSUPERSCRIPT represent the labels, and i𝑖iitalic_i, j𝑗jitalic_j, k𝑘kitalic_k are spatial indices.

In the case of invertible neural networks, the situation is more complicated. The standard form of the fully invertible hyperbolic network outputs the same tensor size as the input, i.e., the number of input and output channels is equal. This is limiting because the data rarely has the same number of input channels as output classes. Fortunately, invertible networks can still be used with a minor modification to the loss function.

Consider data with nchansubscript𝑛chann_{\text{chan}}italic_n start_POSTSUBSCRIPT chan end_POSTSUBSCRIPT channels input, so there are also nchansubscript𝑛chann_{\text{chan}}italic_n start_POSTSUBSCRIPT chan end_POSTSUBSCRIPT output channels, assuming the network contains as many forward as inverse transformations. To train a network for segmenting the data into nclasssubscript𝑛classn_{\text{class}}italic_n start_POSTSUBSCRIPT class end_POSTSUBSCRIPT classes, we need nchannclasssubscript𝑛chansubscript𝑛classn_{\text{chan}}\geq n_{\text{class}}italic_n start_POSTSUBSCRIPT chan end_POSTSUBSCRIPT ≥ italic_n start_POSTSUBSCRIPT class end_POSTSUBSCRIPT. To deal with the mismatch in channel count and the number of classes, we can compute the partial cross-entropy loss

L(𝐘,𝐂)=(i,j,k)l=1nclass𝐂i,j,k,llog(𝐘i,j,k,l).𝐿𝐘𝐂subscript𝑖𝑗𝑘superscriptsubscript𝑙1subscript𝑛classsubscript𝐂𝑖𝑗𝑘𝑙subscript𝐘𝑖𝑗𝑘𝑙L(\mathbf{Y},\mathbf{C})=-\sum_{(i,j,k)}\sum_{l=1}^{n_{\text{class}}}\mathbf{C% }_{i,j,k,l}\log(\mathbf{Y}_{i,j,k,l}).italic_L ( bold_Y , bold_C ) = - ∑ start_POSTSUBSCRIPT ( italic_i , italic_j , italic_k ) end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT class end_POSTSUBSCRIPT end_POSTSUPERSCRIPT bold_C start_POSTSUBSCRIPT italic_i , italic_j , italic_k , italic_l end_POSTSUBSCRIPT roman_log ( bold_Y start_POSTSUBSCRIPT italic_i , italic_j , italic_k , italic_l end_POSTSUBSCRIPT ) . (3.19)

This loss function simply computes the loss via summation over the nclasssubscript𝑛classn_{\text{class}}italic_n start_POSTSUBSCRIPT class end_POSTSUBSCRIPT ‘active’ channels only that correspond to labels. The other channels also have an output which is not used to compute a loss or gradient. These channels are still required to reverse the direction of the network, i.e., all channels are required to generate input data from a prediction (and the second to last layer).

The above assumed that nchannclasssubscript𝑛chansubscript𝑛classn_{\text{chan}}\geq n_{\text{class}}italic_n start_POSTSUBSCRIPT chan end_POSTSUBSCRIPT ≥ italic_n start_POSTSUBSCRIPT class end_POSTSUBSCRIPT, which is generally not the case. The number of input channels can be increased by

  • duplicating (some of) the channels of the input data. This is a valid strategy, but it increases the total memory load as each layer contains more channels. It follows that the input data size grows linearly with the number of duplicated input data channels.

  • transforming the data into more channels while preserving the information. We can achieve these goals by taking a Haar transform (or another orthogonal transform) that reduces the data resolution and increases the number of channels while kee** the memory load constant. This implies applying the transform to data 𝐘𝐘\mathbf{Y}bold_Y changes the sizes according to 𝐖𝐘:nx×ny×nz×nchannx/2×ny/2×nz/2×8nchan:𝐖𝐘superscriptsubscript𝑛𝑥subscript𝑛𝑦subscript𝑛𝑧subscript𝑛chansuperscriptsubscript𝑛𝑥2subscript𝑛𝑦2subscript𝑛𝑧28subscript𝑛chan\mathbf{W}\mathbf{Y}:\mathbb{R}^{n_{x}\times n_{y}\times n_{z}\times n_{\text{% chan}}}\rightarrow\mathbb{R}^{n_{x}/2\times n_{y}/2\times n_{z}/2\times 8n_{% \text{chan}}}bold_WY : blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT × italic_n start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT × italic_n start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT × italic_n start_POSTSUBSCRIPT chan end_POSTSUBSCRIPT end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT / 2 × italic_n start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT / 2 × italic_n start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT / 2 × 8 italic_n start_POSTSUBSCRIPT chan end_POSTSUBSCRIPT end_POSTSUPERSCRIPT. Multiple transforms applied in sequence further increase the number of channels if required. To emphasize, this transform is applied to the data before it enters the network. It is, therefore, a one-time operation and does not add significant computational training time. This option is more desirable in terms of memory.

Suppose one is willing to give up invertibility. In that case, we can simply add a non-square linear operator to map the output of the invertible network to the desired number of output channels for the loss, albeit at a higher memory cost and loss of stability guarantees as in Thm. 2.1.

3.3 Tasks with different input and output dimensions.

Applications like hyperspectral land-use segmentation intrinsically reduce the dimensionality between input data and output by collapsing the frequency axis into a point, i.e., nx×ny×nfreqnx×nysuperscriptsubscript𝑛𝑥subscript𝑛𝑦subscript𝑛freqsuperscriptsubscript𝑛𝑥subscript𝑛𝑦\mathbb{R}^{n_{x}\times n_{y}\times n_{\text{freq}}}\rightarrow\mathbb{R}^{n_{% x}\times n_{y}}blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT × italic_n start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT × italic_n start_POSTSUBSCRIPT freq end_POSTSUBSCRIPT end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT × italic_n start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT end_POSTSUPERSCRIPT. In the case of time-lapse hyperspectral segmentation, there is also a reduction along the time axis as nx×ny×nfreq×ntnx×nysuperscriptsubscript𝑛𝑥subscript𝑛𝑦subscript𝑛freqsubscript𝑛𝑡superscriptsubscript𝑛𝑥subscript𝑛𝑦\mathbb{R}^{n_{x}\times n_{y}\times n_{\text{freq}}\times n_{t}}\rightarrow% \mathbb{R}^{n_{x}\times n_{y}}blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT × italic_n start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT × italic_n start_POSTSUBSCRIPT freq end_POSTSUBSCRIPT × italic_n start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT × italic_n start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT end_POSTSUPERSCRIPT.

The above tasks seem incompatible with the fully invertible hyperbolic neural network that outputs a tensor the same size as the input. This time, we propose to measure the loss over a single slice in the output tensor. That is, embed the known ground-truth labels that depend on the class and spatial coordinates x𝑥xitalic_x and y𝑦yitalic_y in a larger label tensor 𝐂nx×ny×nfreq×nchan𝐂superscriptsubscript𝑛𝑥subscript𝑛𝑦subscript𝑛freqsubscript𝑛chan\mathbf{C}\in\mathbb{R}^{n_{x}\times n_{y}\times n_{\text{freq}}\times n_{% \text{chan}}}bold_C ∈ blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT × italic_n start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT × italic_n start_POSTSUBSCRIPT freq end_POSTSUBSCRIPT × italic_n start_POSTSUBSCRIPT chan end_POSTSUBSCRIPT end_POSTSUPERSCRIPT at slice p𝑝pitalic_p as 𝐂:,:,p,1:nclasssubscript𝐂:::𝑝1subscript𝑛class\mathbf{C}_{:,:,p,1:n_{\text{class}}}bold_C start_POSTSUBSCRIPT : , : , italic_p , 1 : italic_n start_POSTSUBSCRIPT class end_POSTSUBSCRIPT end_POSTSUBSCRIPT. All other entries in the label tensor do not exist and do not contribute to the loss or its gradient computation. See Peters et al. (2019b, c) for more information and applications of partial loss functions. The resulting multi-class cross-entropy function reads

L(𝐘,𝐂)=(i,j)l=1nclass𝐂i,j,p,llog(𝐘i,j,p,l),𝐿𝐘𝐂subscript𝑖𝑗superscriptsubscript𝑙1subscript𝑛classsubscript𝐂𝑖𝑗𝑝𝑙subscript𝐘𝑖𝑗𝑝𝑙L(\mathbf{Y},\mathbf{C})=-\sum_{(i,j)}\sum_{l=1}^{n_{\text{class}}}\mathbf{C}_% {i,j,p,l}\log(\mathbf{Y}_{i,j,p,l}),italic_L ( bold_Y , bold_C ) = - ∑ start_POSTSUBSCRIPT ( italic_i , italic_j ) end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT class end_POSTSUBSCRIPT end_POSTSUPERSCRIPT bold_C start_POSTSUBSCRIPT italic_i , italic_j , italic_p , italic_l end_POSTSUBSCRIPT roman_log ( bold_Y start_POSTSUBSCRIPT italic_i , italic_j , italic_p , italic_l end_POSTSUBSCRIPT ) , (3.20)

where p𝑝pitalic_p is the fixed tensor-slice index that contains the labels. When only sparse spatial location indices of known labels are available, the sum over (i,j)𝑖𝑗(i,j)( italic_i , italic_j ) reduces to a sum over the subset of labeled pixels.

3.4 Resolution changes between input and output.

Some applications desire a different resolution for output compared to the input. Once more, straightforward application of the fully invertible hyperbolic network cannot accomplish resolution changes between input and output because the input and output tensors have the same size. It turns out that there is still a way to train an invertible network such that input and output are on different resolutions. To start, consider a single network layer of an invertible hyperbolic network that decreases the resolution via

𝐘3=2𝐖𝐘2𝐖𝐘1h2𝐊3σ(𝐊3𝐖𝐘2),subscript𝐘32subscript𝐖𝐘2subscript𝐖𝐘1superscript2superscriptsubscript𝐊3top𝜎subscript𝐊3subscript𝐖𝐘2\mathbf{Y}_{3}=2\mathbf{W}\mathbf{Y}_{2}-\mathbf{W}\mathbf{Y}_{1}-h^{2}\mathbf% {K}_{3}^{\top}\sigma(\mathbf{K}_{3}\mathbf{W}\mathbf{Y}_{2}),bold_Y start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT = 2 bold_WY start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - bold_WY start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_h start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_K start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_σ ( bold_K start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT bold_WY start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ,

where 𝐖𝐖\mathbf{W}bold_W is an orthogonal transform that transforms and reorganizes the 4D input as 𝐖𝐘:n1×n2×n3×nchann1/2×n2/2×n3/2×8nchan:𝐖𝐘superscriptsubscript𝑛1subscript𝑛2subscript𝑛3subscript𝑛chansuperscriptsubscript𝑛12subscript𝑛22subscript𝑛328subscript𝑛chan\mathbf{W}\mathbf{Y}:\mathbb{R}^{n_{1}\times n_{2}\times n_{3}\times n_{\text{% chan}}}\rightarrow\mathbb{R}^{n_{1}/2\times n_{2}/2\times n_{3}/2\times 8n_{% \text{chan}}}bold_WY : blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT × italic_n start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT × italic_n start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT × italic_n start_POSTSUBSCRIPT chan end_POSTSUBSCRIPT end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT / 2 × italic_n start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT / 2 × italic_n start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT / 2 × 8 italic_n start_POSTSUBSCRIPT chan end_POSTSUBSCRIPT end_POSTSUPERSCRIPT. Note that the first output channel of the single-level Haar transform is the input on a resolution reduced by a factor of two. Similarly, for other transforms like the pixel-shuffle transform. So as long as the network contains more forward transforms than inverse transforms, the output resolution is lowered by increments of two. Network training can utilize this concept by defining the loss function over a particular selection of output channels. Similar logic applies to resolution increases.

4 Examples

The foundation of the following experiments is the fully invertible hyperbolic network, implemented by Witte et al. (2020); Orozco et al. (2023). Its specialization to geoscientific and remote sensing problems is available at https://github.com/PetersBas/FHN_Examples. The experiments illustrate that the proposed techniques enable the application of fully invertible hyperbolic networks to various large-scale geoscience problems while obtaining satisfactory results. Table 1 summarizes the memory requirements for the states and convolutional kernels, not including memory allocations for intermediate computations, Lagrangian multipliers for gradient computations and the gradient itself. The table shows that both invertibility and block-low-rank layers are required to run examples on a standard GPU, like a 24242424GB NVIDIA GeForce RTX 3090. Figure 2 shows the general memory scaling of our approach. In order to train a non-invertible network with the same number of channels per layer and the same input data size, the number of layers needs to be reduced to keep the memory footprint manageable.

4.1 Time-lapse hyperspectral land-use change detection

The data 𝐗nx×ny×nfreq×nt𝐗superscriptsubscript𝑛𝑥subscript𝑛𝑦subscript𝑛freqsubscript𝑛𝑡\mathbf{X}\in\mathbb{R}^{n_{x}\times n_{y}\times n_{\text{freq}}\times n_{t}}bold_X ∈ blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT × italic_n start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT × italic_n start_POSTSUBSCRIPT freq end_POSTSUBSCRIPT × italic_n start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, has two spatial coordinates nxsubscript𝑛𝑥n_{x}italic_n start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT and nysubscript𝑛𝑦n_{y}italic_n start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT, the third dimension corresponds to frequency, and there is one channel per time of data collection (two in this example). Figure 4 displays this data set (Hasanlou and Seydi, 2018). We follow common practice in hyperspectral imaging literature, where part of the segmentation is assumed known. The red and white dots in Figure 5 show the 70707070 known label locations, and training utilizes 50505050 annotations, while the validation is based on the remaining 20202020.

Refer to caption
(a)
Refer to caption
(b)
Figure 4: Hyperspectral data collected at two different times.
Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Figure 5: (a) Plan view of all true labels with point-annotation locations for training and validation overlaid. (b) prediction, and (c) error map. Most errors are boundary effects, and just a few farm fields are identified as changed/not-changed incorrectly (red arrows highlight two examples).
Refer to caption
Figure 6: Validation losses for the hyperspectral example, for the proposed network and three different multi-level ResNets.

This example aims to predict the land-use change on a coarser grid. Table 2 contains the network details that correspond to a network that contains one more Haar transform than inverse Haar transforms, so the output feature resolution decreases by a factor of two. We use stochastic gradient descent with momentum and a decaying learning rate for 70707070 iterations to minimize the loss function (3.20). The loss is measured over a few slices of the output tensor that embeds the labels. Figure 5 shows true land-use change, prediction, and errors. Aside from some boundary artifacts, a few farm fields were classified incorrectly. The low-memory nature of the invertible network enabled us to input the entire 4D data in one chunk; see also Table 1 for details regarding memory usage. We also show the validation loss in Figure 6, as well as the validation losses for three comparison multi-level ResNets (see the Appendix for the network details). Because of the memory constraints, the ResNets need to be significantly shorter than the invertible network with BLR layers. The losses for the various ResNets are comparable but do not approach the loss of the proposed network.

4.2 Regional-scale aquifer map**

Refer to caption
Figure 7: The data inputs for the aquifer map** example. Each type is placed in a separate channel of the input. We do not use the two geological maps as images. Instead, each class is converted to a map with zero/one values, resulting in 52 separate geological maps.

The task is to delineate large aquifers in Arizona, USA; see Figure 9. The two classes are 1) basin and range / Colorado Plateau aquifer; 2) no aquifer (Robson and Banta, 1995). The survey area is most of the state. Aircraft and satellite-based sensors collected magnetic data, two types of gravity measurements, and the topography, see Figure 7. Besides these remotely acquired data, we supplement two types of geological maps: one map in terms of the rock age and one in terms of rock types. The advantage of using geological maps is that they incorporate expert knowledge into our data. Geologists construct these maps by synthesizing their geological knowledge with ground truth observations, hyperspectral data, and various airborne and land-based geophysical surveys. Because the geological maps in Figure 7 are not invariant under the permutation of the class numbers, we use its one-hot encoding (52 separate maps).

Table 2: Network designs for the fully invertible networks.
Hyperspectral
Layer Channels Block rank Feature size
1-6 16 16 368×288×184368288184368\times 288\times 184368 × 288 × 184
7-18 128 16 184×144×9218414492184\times 144\times 92184 × 144 × 92
Aquifer map**
Layer Channels Block rank Feature size
1-4 112 24 848×14568481456848\times 1456848 × 1456
5-7 448 24 424×728424728424\times 728424 × 728
8-10 1792 24 212×364212364212\times 364212 × 364
11-28 7168 24 106×182106182106\times 182106 × 182
29-32 1792 24 212×364212364212\times 364212 × 364
32-34 448 24 424×728424728424\times 728424 × 728
35-39 112 24 848×14568481456848\times 1456848 × 1456
Refer to caption
Refer to caption
Figure 8: Locations of the training and validation labels for the aquifer map** example.

The experimental setting assumes an expert annotated the aquifers in a few patches, see Figure 8, so the neural network assists a domain expert by interpolating and extrapolating limited annotation. Training is similar to the previous example: SGD with momentum with a decaying learning rate for 140140140140 iterations to minimize the multi-class cross-entropy loss. Each iteration uses about 10%percent1010\%10 % of the known labels (randomly selected) to compute an approximation of the loss and the gradient. We also augment the data with random flips and permutations. The network details can be found in Table 2. The 39393939 layer fully invertible hyperbolic network uses three coarsening stages (see table 2) to increase the receptive field size and enable information to propagate over larger spatial distances. Figure 9 displays the results and errors. Most of the errors are concentrated along some of the geological rock-type boundaries.

This example showed that neural networks could assist domain experts and mimic their work for aquifer map**. Invertible networks can deal with such large computational domains with many input channels in one chunk. The invertibility reduced the memory required for storing the network states from 21.0221.0221.0221.02GB to just 1.661.661.661.66GB, see table 1. For this example with many input channels and multiple coarsening stages, training the parameters of the network in a compressed/factorized form directly used just 0.120.120.120.12GB for storing the convolutional kernels instead of the 32.1932.1932.1932.19GB for storing unreasonably many convolutional kernels that a standard invertible hyperbolic network would require for this particular design, see table 1.

Refer to caption
Refer to caption
Refer to caption
Figure 9: True aquifer map, prediction, and difference.

4.3 3D interpolation-segmentation of a seismic image volume from borehole information.

Building a 3D geological model from seismic imaging means grou** several layers, structures, or geological units to obtain a simplified geological model that conveys the information of interest. The interpretation of seismic volumes is challenging due to imaging artifacts (data noise, violation of assumptions in the imaging algorithm, poor illumination from seismic waves), spatial variation in the appearance of the interfaces, discontinuities of the interface due to geological faults, and a lack of ground-truth away from boreholes.

Here, we present an experiment aiming to segment the full 3D seismic volume (Figure 10) from borehole information. We assume that a small area around each borehole was interpreted by an expert and can serve as training and validation labels (Figure 10). Interpreting the seismic image close to boreholes is relatively easy due to the proximity to the ground truth.

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Figure 10: (a) Full 3D data; training uses sub-cubes. (b) fully labeled data volume, training and validation use small parts of the labels, as indicated by the plan-view figure (c) of the computational domain and highlighted are the training/validation locations (small areas near borehole locations). The remainder of the labels are used for testing purposes only.

While seismic interpretation from borehole information is nothing new, most work maps 2D image-to-image. Some 3D seismic interpretation approaches operate 3D-to-3D but are limited to training with relatively small 3D sub-cubes due to memory limitations and use sub-cubes of up to 1283superscript1283128^{3}128 start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT (Wu et al., 2019; Zhao, 2019; Shi et al., 2018; Wang and Nealon, 2019; Gao et al., 2021; Dou et al., 2022). Using larger 3D sub-cubes enables learning larger-scale structures present in the data. Peters and Haber (2020) show the first 3D interpretation approach using an invertible network and use an input size of up to 192×192×288×31921922883192\times 192\times 288\times 3192 × 192 × 288 × 3. Here, we show results from training on the largest inputs to date that are about 5.7×5.7\times5.7 × larger than previous work.

The full data has a size of 401×701×248401701248401\times 701\times 248401 × 701 × 248. For training, we replicate the data into 12121212 channels and input to the network a randomly selected sub-cube of size 248×248×248×1224824824812248\times 248\times 248\times 12248 × 248 × 248 × 12. Table 3 lists the network details. We reduce the cross-entropy loss using the ADAM optimizer for 240240240240 iterations with a decaying stepsize. Each iteration selects a randomly located sub-cube. Table 4 displays the final results’ intersection over union (IoU). Figure 11 shows 2D cross-sections from the 3D volume, accompanied by the true labels. The final prediction results from inference on the full data volume split into a couple of overlap** pieces. This example shows that we can obtain good segmentation results from partial labeling while training on large 3D input sub-cubes. The size of the sub-cubes is important because selecting only small sub-cubes would lead to most cubes containing no labels. Larger sub-cubes can connect more of the data to labeled locations.

Table 3: Network design for the fully invertible network for the 3D seismic interpretation example.
Layer Block rank Channels Feature size
1-2 8 12 248×248×248248248248248\times 248\times 248248 × 248 × 248
3-5 16 96 124×124×124124124124124\times 124\times 124124 × 124 × 124
6-8 32 768 62×62×6262626262\times 62\times 6262 × 62 × 62
9-18 32 6144 31×31×3131313131\times 31\times 3131 × 31 × 31
19-21 32 768 62×62×6262626262\times 62\times 6262 × 62 × 62
22-24 16 96 124×124×124124124124124\times 124\times 124124 × 124 × 124
25-30 8 12 248×248×248248248248248\times 248\times 248248 × 248 × 248

Table 1 displays the memory usage of our network and equivalent standard invertible and non-invertible networks. The table also shows that a direct comparison with the equivalent non-invertible version of the hyperbolic network, and without block-low-rank layers is not possible on a standard 24absent24\approx 24≈ 24 GB GPU. Instead, we evaluated two closely related networks that just fit on the GPU. These networks are similar to ours (Table 3), except with one less level and a few layers shorter. See the Appendix for network details and Figure 13 for prediction images. Table 4 contains the statistics in terms of intersection over union (IoU), which shows that using one level less or reducing the number of layers to make the network fit/trainable on a GPU, comes at the cost of a significant drop in IoU score.

Table 4: Validation IoU values for the 3D seismic example. Invertible + BLR layer network design from Table 3 is according to Equation (2). The memory requirements for this network and its non-invertible equivalent without block-low-rank layers are shown in Table 1. See the Appendix for the design details of networks &ba{}^{a}\&^{b}start_FLOATSUPERSCRIPT italic_a end_FLOATSUPERSCRIPT & start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT.
Network type Validation IoU
Invertible + BLR layers (ours) class 1/class 2: 0.970.970.970.97/0.960.960.960.96
Non-invertible equivalent, no BLR layers out of memory
Largest non-invertible relateda class 1/class 2: 0.910.910.910.91 / 0.850.850.850.85
Largest non-invertible relatedb class 1/class 2: 0.920.920.920.92 / 0.880.880.880.88

This example also showed that multi-level, fully invertible hyperbolic networks are suitable for 3D seismic segmentation and that the more than six thousand channels are no issue if we use symmetric layers with a low block-rank.

Refer to caption
Figure 11: Three orthogonal cross-sections from the final prediction and the true labels. Training label locations are shaded.

4.4 Effect of the selection of the maximum block-rank

The previous examples utilize a block rank of 𝐊𝐊superscript𝐊top𝐊\mathbf{K}^{\top}\mathbf{K}bold_K start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_K that is much below full rank. Naturally, questions arise about selecting the block rank and determining the sensitivity of the network performance to the block rank.

The maximum rank we can practically select is limited by the available memory to store convolutional kernels, or by the computational time that is available. Figure 12 provides a more quantitative and experimental answer. The figure shows all experiments repeated for various values of the maximum block-rank and averaged over five random initializations of the network parameters. The conclusion is that a very low block rank generally comes at the cost of slightly reduced prediction quality, as measured using intersection over union. A high block rank can also slightly reduce the prediction quality because it generates less implicit regularization (no other forms of regularization were used in the numerical experiments.) We finalize the experimental evaluation by noting that the IoU results show clear trends, but the assessment of the geoscientific and remote sensing examples remains challenging because the seismic, aquifer, and hyperspectral labels are expert interpretations of the data, and not ground truth. The aquifer map** example is based on various data sources, including ones not fed into the network. All of these come with some ambiguity.

Refer to caption
Refer to caption
Refer to caption
Figure 12: Plots of the IoU versus the selected block-rank.

5 Conclusions

The high memory requirements for training a deep neural network using automatic differentiation is a critical issue that limits the application to large input-data blocks like hyperspectral data, very large-scale multi-modality 2D geoscientific maps and airborne-remote sensing, as well as 3D seismic imagery. Fully invertible neural networks mostly solve the memory requirement issues for network states and achieve a constant memory footprint independent of the number of layers and pooling/coarsening stages inside the network.

This work takes a closer look at the fully invertible hyperbolic network based on a conservative leapfrog discretization of the non-linear hyperbolic telegraph equation. Problems are uncovered that prevent the direct application of fully invertible hyperbolic networks to tasks that require multiple coarsening/pooling stages in the network, applications that reduce a 3D/4D tensor into a 2D map of the earth, tasks with different numbers of input and output channels, and applications with resolution changes. For each of these issues, we provided a solution that enables the application of the network without fundamentally altering its design.

We introduce a layer design where the matrix representation of the convolutional kernels has a low block-rank structure. This design changes the exponential growth of the memory for convolutional kernels as a function of the number of pooling stages into a tuneable memory footprint. Changing the number of channels, dimensions, and resolution between input and output is enabled by embedding the labels into larger tensors in combination with particular ways to measure the loss, and using a different number of forward and inverse orthogonal transforms inside the network.

Examples illustrate how to apply fully invertible networks to time-lapse hyperspectral data with a resolution change, very large-scale multi-modal remote sensing for sub-surface aquifer map**, and 3D seismic interpretation. The tools developed in this work enable invertible networks to be applied to these problems and thus learn from larger input blocks of data, which in turn enables the network to learn from larger-scale structures present in the data.

References

  • Kervadec et al. [2019] Hoel Kervadec, Jose Dolz, Meng Tang, Eric Granger, Yuri Boykov, and Ismail Ben Ayed. Constrained-cnn losses for weakly supervised segmentation. Medical Image Analysis, 54:88–99, 2019. ISSN 1361-8415. doi:https://doi.org/10.1016/j.media.2019.02.009. URL https://www.sciencedirect.com/science/article/pii/S1361841518306145.
  • Peters [2022] Bas Peters. Point-to-set distance functions for output-constrained neural networks. J. Appl. Numer. Optim, 4(2):175–201, 2022.
  • Jia et al. [2021] Fan Jia, Jun Liu, and Xue-Cheng Tai. A regularized convolutional neural network for semantic image segmentation. Analysis and Applications, 19(01):147–165, 2021. doi:10.1142/S0219530519410148. URL https://doi.org/10.1142/S0219530519410148.
  • Lensink et al. [2022] Keegan Lensink, Bas Peters, and Eldad Haber. Fully hyperbolic convolutional neural networks. Research in the Mathematical Sciences, 9(4):1–22, 2022.
  • Ruthotto and Haber [2018] Lars Ruthotto and Eldad Haber. Deep neural networks motivated by partial differential equations. Journal of Mathematical Imaging and Vision, pages 1–13, 2018.
  • Behrmann et al. [2019] Jens Behrmann, Will Grathwohl, Ricky T. Q. Chen, David Duvenaud, and Joern-Henrik Jacobsen. Invertible residual networks. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 573–582. PMLR, 09–15 Jun 2019. URL https://proceedings.mlr.press/v97/behrmann19a.html.
  • Etmann et al. [2020] Christian Etmann, Rihuan Ke, and Carola-Bibiane Schönlieb. iunets: Learnable invertible up- and downsampling for large-scale inverse problems. In 2020 IEEE 30th International Workshop on Machine Learning for Signal Processing (MLSP), pages 1–6, 2020. doi:10.1109/MLSP49062.2020.9231874.
  • Jacobsen et al. [2018] Jörn-Henrik Jacobsen, Arnold W.M. Smeulders, and Edouard Oyallon. i-revnet: Deep invertible networks. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=HJsjkMb0Z.
  • van de Leemput et al. [2019] Sil C van de Leemput, Jonas Teuwen, Bram van Ginneken, and Rashindra Manniesing. Memcnn: A python/pytorch package for creating memory-efficient invertible neural networks. Journal of Open Source Software, 4(39):1576, 2019.
  • Chang et al. [2018] Bo Chang, Lili Meng, Eldad Haber, Lars Ruthotto, David Begert, and Elliot Holtham. Reversible architectures for arbitrarily deep residual neural networks. Proceedings of the AAAI Conference on Artificial Intelligence, 32, 2018. doi:10.1609/aaai.v32i1.11668. URL https://ojs.aaai.org/index.php/AAAI/article/view/11668.
  • Gomez et al. [2017] Aidan N Gomez, Mengye Ren, Raquel Urtasun, and Roger B Grosse. The reversible residual network: Backpropagation without storing activations. In Adv Neural Inf Process Syst, pages 2211–2221, 2017.
  • Dinh et al. [2016] Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density estimation using real NVP. CoRR, abs/1605.08803, 2016. URL http://arxiv.longhoe.net/abs/1605.08803.
  • Peters et al. [2019a] Bas Peters, Eldad Haber, and Keegan Lensink. Symmetric block-low-rank layers for fully reversible multilevel neural networks. arXiv preprint arXiv:1912.12137, 2019a.
  • Zhou and Luo [2018] Yanjie Zhou and Zhendong Luo. A crank–nicolson collocation spectral method for the two-dimensional telegraph equations. Journal of Inequalities and Applications, 2018:1–17, 2018.
  • Haber and Ruthotto [2017] Eldad Haber and Lars Ruthotto. Stable architectures for deep neural networks. Inverse Problems, 34(1):014004, dec 2017. doi:10.1088/1361-6420/aa9a90.
  • LeVeque [1990] R.J. LeVeque. Numerical Methods for Conservation Laws. Birkhauser, 1990.
  • Truchetet and Laligant [2004] Frederic Truchetet and Olivier Laligant. Wavelets in industrial applications: a review. Wavelet Applications in Industrial Processing II, 5607, 2004. doi:10.1117/12.580395. URL https://doi.org/10.1117/12.580395.
  • Evans [2010] L.C. Evans. Partial Differential Equations. Graduate studies in mathematics. American Mathematical Society, 2010. ISBN 9780821849743. URL https://books.google.ca/books?id=Xnu0o_EJrCQC.
  • Glorot and Bengio [2010] Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. In Yee Whye Teh and Mike Titterington, editors, Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, volume 9 of Proceedings of Machine Learning Research, pages 249–256, Chia Laguna Resort, Sardinia, Italy, 13–15 May 2010. PMLR. URL https://proceedings.mlr.press/v9/glorot10a.html.
  • Denton et al. [2014] Emily L Denton, Wojciech Zaremba, Joan Bruna, Yann LeCun, and Rob Fergus. Exploiting linear structure within convolutional networks for efficient evaluation. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 1269–1277. Curran Associates, Inc., 2014. URL http://papers.nips.cc/paper/5544-exploiting-linear-structure-within-convolutional-networks-for-efficient-evaluation.pdf.
  • Ephrath et al. [2019] Jonathan Ephrath, Lars Ruthotto, Eldad Haber, and Eran Treister. Leanresnet: A low-cost yet effective convolutional residual networks. arXiv preprint arXiv:1904.06952, 2019.
  • Ding et al. [2017] C. Ding, S. Liao, Y. Wang, Z. Li, N. Liu, Y. Zhuo, C. Wang, X. Qian, Y. Bai, G. Yuan, X. Ma, Y. Zhang, J. Tang, Q. Qiu, X. Lin, and B. Yuan. Circnn: Accelerating and compressing deep neural networks using block-circulant weight matrices. In 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 395–408, Oct 2017.
  • Treister et al. [2018] Eran Treister, Lars Ruthotto, Michal Sharoni, Sapir Zafrani, and Eldad Haber. Low-cost parameterizations of deep convolutional neural networks. arXiv preprint arXiv:1805.07821, 2018.
  • Wu [2016] Jia-Nan Wu. Compression of fully-connected layer in neural network by kronecker product. In 2016 Eighth International Conference on Advanced Computational Intelligence (ICACI), pages 173–179, 2016. doi:10.1109/ICACI.2016.7449822.
  • Rennie and Srebro [2005] Jasson DM Rennie and Nathan Srebro. Fast maximum margin matrix factorization for collaborative prediction. In Proceedings of the 22nd international conference on Machine learning, pages 713–719. ACM, 2005.
  • Aravkin et al. [2014] Aleksandr. Aravkin, Rajiv. Kumar, Hassan. Mansour, Ben. Recht, and Felix J. Herrmann. Fast methods for denoising matrix completion formulations, with applications to robust seismic data interpolation. SIAM Journal on Scientific Computing, 36(5):S237–S266, 2014. doi:10.1137/130919210. URL https://doi.org/10.1137/130919210.
  • Szegedy et al. [2015] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1–9, June 2015.
  • He et al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016.
  • Iandola et al. [2016] Forrest N Iandola, Song Han, Matthew W Moskewicz, Khalid Ashraf, William J Dally, and Kurt Keutzer. Squeezenet: Alexnet-level accuracy with 50x fewer parameters and< 0.5 mb model size. arXiv preprint arXiv:1602.07360, 2016.
  • Peters et al. [2019b] Bas Peters, Justin Granek, and Eldad Haber. Multiresolution neural networks for tracking seismic horizons from few training images. Interpretation, 7(3):SE201–SE213, 2019b. doi:10.1190/INT-2018-0225.1. URL https://doi.org/10.1190/INT-2018-0225.1.
  • Peters et al. [2019c] Bas Peters, Eldad Haber, and Justin Granek. Neural networks for geophysicists and their application to seismic data interpretation. The Leading Edge, 38(7):534–540, 2019c. doi:10.1190/tle38070534.1. URL https://doi.org/10.1190/tle38070534.1.
  • Witte et al. [2020] Philipp Witte, grizzuti, Mathias Louboutin, Ali Siahkoohi, and Felix Herrmann. slimgroup/InvertibleNetworks.jl: Jacobians and adjoint Jacobians of layers and networks, November 2020. URL https://doi.org/10.5281/zenodo.4298853.
  • Orozco et al. [2023] Rafael Orozco, Philipp Witte, Mathias Louboutin, Ali Siahkoohi, Gabrio Rizzuti, Bas Peters, and Felix J Herrmann. Invertiblenetworks. jl: A julia package for scalable normalizing flows. arXiv preprint arXiv:2312.13480, 2023.
  • Hasanlou and Seydi [2018] Mahdi Hasanlou and Seyd Teymoor Seydi. Hyperspectral change detection: an experimental comparative study. International Journal of Remote Sensing, 39(20):7029–7083, 2018. doi:10.1080/01431161.2018.1466079. URL https://doi.org/10.1080/01431161.2018.1466079.
  • Robson and Banta [1995] Stanley G. Robson and Edward R. Banta. Ground water atlas of the united states: Segment 2, arizona, colorado, new mexico, utah. Technical report, U.S. Geological Survey, 1995. URL http://pubs.er.usgs.gov/publication/ha730C.
  • Wu et al. [2019] Xinming Wu, Luming Liang, Yunzhi Shi, and Sergey Fomel. Faultseg3d: Using synthetic data sets to train an end-to-end convolutional neural network for 3d seismic fault segmentation. GEOPHYSICS, 84(3):IM35–IM45, 2019. doi:10.1190/geo2018-0646.1. URL https://doi.org/10.1190/geo2018-0646.1.
  • Zhao [2019] Tao Zhao. 3d convolutional neural networks for efficient fault detection and orientation estimation. In SEG Technical Program Expanded Abstracts 2019, pages 2418–2422, 2019. doi:10.1190/segam2019-3216307.1. URL https://library.seg.org/doi/abs/10.1190/segam2019-3216307.1.
  • Shi et al. [2018] Yunzhi Shi, Xinming Wu, and Sergey Fomel. Automatic salt-body classification using deep-convolutional neural network. In SEG Technical Program Expanded Abstracts 2018, pages 1971–1975, 2018. doi:10.1190/segam2018-2997304.1. URL https://library.seg.org/doi/abs/10.1190/segam2018-2997304.1.
  • Wang and Nealon [2019] Enning Wang and Jeff Nealon. Applying machine learning to 3d seismic image denoising and enhancement. Interpretation, 7(3):SE131–SE139, 2019. doi:10.1190/INT-2018-0224.1. URL https://doi.org/10.1190/INT-2018-0224.1.
  • Gao et al. [2021] Kai Gao, Lianjie Huang, Yingcai Zheng, Rongrong Lin, Hao Hu, and Trenton Cladohous. Automatic fault detection on seismic images using a multiscale attention convolutional neural network. Geophysics, 87(1):N13–N29, 11 2021. ISSN 0016-8033. doi:10.1190/geo2020-0945.1. URL https://doi.org/10.1190/geo2020-0945.1.
  • Dou et al. [2022] Yimin Dou, Kewen Li, Jianbing Zhu, Timing Li, Shaoquan Tan, and Zongchao Huang. Md loss: Efficient training of 3-d seismic fault segmentation network under sparse labels by weakening anomaly annotation. IEEE Transactions on Geoscience and Remote Sensing, 60:1–14, 2022. doi:10.1109/TGRS.2022.3196810.
  • Peters and Haber [2020] Bas Peters and Eldad Haber. Fully reversible neural networks for large-scale 3d seismic horizon tracking. In EAGE 2020 Annual Conference & Exhibition Online, volume 2020, pages 1–5. European Association of Geoscientists & Engineers, 2020.

Appendix A Additional experimental details

Here, we provide more details regarding the comparison network designs. Table 1 shows the memory requirements for the fully invertible hyperbolic network with block-low-rank (BLR) layers. Table 1 also shows the memory in case we do not use the invertibility of the network to compute gradients, and if we do not use BLR layers. Details for the comparison networks in the numerical experiments section are provided below.

A.1 Hyperspectral time-lapse

We compared the results of our proposed network to the symmetric ResNet [Haber and Ruthotto, 2017]

𝐘j=𝐘j1h𝐊(𝜽j)σ(𝐊(𝜽j)𝐘j1).subscript𝐘𝑗subscript𝐘𝑗1𝐊superscriptsubscript𝜽𝑗top𝜎𝐊subscript𝜽𝑗subscript𝐘𝑗1\mathbf{Y}_{j}=\>\mathbf{Y}_{j-1}-h\mathbf{K}(\boldsymbol{\theta}_{j})^{\top}% \sigma(\mathbf{K}(\boldsymbol{\theta}_{j})\mathbf{Y}_{j-1}).bold_Y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = bold_Y start_POSTSUBSCRIPT italic_j - 1 end_POSTSUBSCRIPT - italic_h bold_K ( bold_italic_θ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_σ ( bold_K ( bold_italic_θ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) bold_Y start_POSTSUBSCRIPT italic_j - 1 end_POSTSUBSCRIPT ) . (1.21)

The hyperspectral example requires a two-level network with one pooling operation. We employ the BLR layers, as in the fully invertible hyperbolic network. However, because the ResNet relies on automatic differentiation and not on invertibility for gradient computations, it is impossible to fit the same number of network layers on the GPU. Therefore, we have to shorten the network. Figure 6 compared three ResNets, and Table 5 describes the design details.

Comparison ResNets for the Hyperspectral example
Network ##\## layers level 1 ##\## layers level 2
1 3 4 Fig. 6
2 2 5 Fig. 6
3 5 2 Fig. 6
4 6 1 out-of-memory
5 3 5 out-of-memory
6 5 3 out-of-memory
Table 5: Network designs for the comparison ResNets designs according to Eq. 1.21. These are all two-level networks, i.e., a few ResNet blocks followed by a pooling operation and some more ResNet blocks.

A.2 3D Seismic Interpretation

Memory-wise, it is impossible to train the seismic segmentation example on most GPUs if we are not using invertibility and BLR kernels, because the convolutional kernels then require 41.16 GB and just the states require 21.96 GB. These numbers do not include memory for intermediate computations, the gradient itself, and other storage for the optimizer. For a comparison, we construct the largest possible networks that fit on a 24242424GB GPU, that are as similar as possible to the network in Table 3. We modify the design by reducing the number of levels by one level, and shortening the network. Network a uses BLR layers, while network b does not. We refer back to Table 4 for the performance metrics, which show the networks with fewer layers and one level less significantly underperform the proposed network. See Figure 13 for predictions from the shorter network.

Table 6: Network designs that just fit on a 24242424GB GPU, to compare to our fully invertible network for the 3D seismic interpretation example.
Layer Block rank Channels Feature size
network a
1-2 8 12 248×248×248248248248248\times 248\times 248248 × 248 × 248
3-5 16 96 124×124×124124124124124\times 124\times 124124 × 124 × 124
6-17 32 768 62×62×6262626262\times 62\times 6262 × 62 × 62
18-20 16 96 124×124×124124124124124\times 124\times 124124 × 124 × 124
21-26 8 12 248×248×248248248248248\times 248\times 248248 × 248 × 248
network b
1-2 12 12 248×248×248248248248248\times 248\times 248248 × 248 × 248
3-4 96 96 124×124×124124124124124\times 124\times 124124 × 124 × 124
5-6 768 768 62×62×6262626262\times 62\times 6262 × 62 × 62
7-8 96 96 124×124×124124124124124\times 124\times 124124 × 124 × 124
9-14 12 12 248×248×248248248248248\times 248\times 248248 × 248 × 248
Refer to caption
Refer to caption
Figure 13: Predictions using the comparison networks form Table 6, network a (top) and network b (bottom). Figures show three orthogonal cross-sections from the final prediction and the true labels. Training label locations are shaded. Both results look mostly realistic, but contain more holes and incorrectly assigned patches compared to Figure 11.