Fully invertible hyperbolic neural networks for segmenting large-scale surface and sub-surface data

Bas Peters
Computational Geosciences Inc.
Vancouver, BC
[email protected]
&Eldad Haber
Department of Earth, Ocean, and Atmospheric Sciences
The University of British Columbia
Vancouver, BC
\ANDKeegan Lensink
Department of Earth, Ocean, and Atmospheric Sciences
The University of British Columbia
Vancouver, BC

Abstract

The large spatial/temporal/frequency scale of geoscience and remote-sensing datasets causes memory issues when using convolutional neural networks for (sub-) surface data segmentation. Recently developed fully reversible or fully invertible networks can mostly avoid memory limitations by recomputing the states during the backward pass through the network. This results in a low and fixed memory requirement for storing network states, as opposed to the typical linear memory growth with network depth. This work focuses on a fully invertible network based on the telegraph equation. While reversibility saves the major amount of memory used in deep networks by the data, the convolutional kernels can take up most memory if fully invertible networks contain multiple invertible pooling/coarsening layers. We address the explosion of the number of convolutional kernels by combining fully invertible networks with layers that contain the convolutional kernels in a compressed form directly. A second challenge is that invertible networks output a tensor the same size as its input. This property prevents the straightforward application of invertible networks to applications that map between different input-output dimensions, need to map to outputs with more channels than present in the input data, or desire outputs that decrease/increase the resolution compared to the input data. However, we show that by employing invertible networks in a non-standard fashion, we can still use them for these tasks. Examples in hyperspectral land-use classification, airborne geophysical surveying, and seismic imaging illustrate that we can input large data volumes in one chunk and do not need to work on small patches, use dimensionality reduction, or employ methods that classify a patch to a single central pixel.

Keywords Invertible Neural Networks $\cdot$ Large Scale Deep Learning $\cdot$ Memory Efficient Deep Learning

1 Introduction

Many datasets in the imaging sciences are intrinsically 3D or 4D. For instance, the interpretation of seismic imagery, hyperspectral data segmentation for land-use classification, and segmentation of various medical imagery. To construct convolutional neural networks with a sufficiently large field-of-view (receptive field), we need deeper networks with more layers or multiple coarsening (pooling) and refinement stages in the network. Reasons we wish to work on large chunks of data instead of many small patches include wanting to learn from larger length scales that may be present in the data, as well as weakly supervised approaches that add prior knowledge or constraints related to properties of full images (Kervadec et al., 2019; Peters, 2022; Jia et al., 2021).

The dominant factor that limits the input data size and network depth is the storage of the network state, that is, the convolved data at each layer that is needed in order to compute a gradient of the loss function using back-propagation, implemented via reverse-mode automatic differentiation. Re-computing the network (forward) states in reverse order during back-propagation avoids this problem. This re-computation is possible when using fully invertible (also known as reversible) networks. Fully invertible networks have a constant memory requirement for states (activations) that is independent of network depth and the number of pooling stages, see Figure 2. Therefore, fully invertible networks largely avoid the memory limitations related to storing states.

Specifically, we employ a second-order hyperbolic differential equation based invertible network (Lensink et al., 2022). The connection with the wave equation enables the use of a suite of tools from analysis to interpretations that are commonly used in mathematical physics and numerical analysis. We note that other invertible network constructions exist, including invertible Hamiltonians (Ruthotto and Haber, 2018), invertible ResNets (Behrmann et al., 2019), invertible u-nets (Etmann et al., 2020). The fully invertible networks generally extend invertible networks for image classification (Jacobsen et al., 2018; van de Leemput et al., 2019) and networks that are only invertible in between coarsening/pooling stages (Ruthotto and Haber, 2018; Chang et al., 2018; Gomez et al., 2017; Dinh et al., 2016).

Fully invertible networks use invertible pooling/coarsening operations. Examples are the Haar transform (Lensink et al., 2022), reordering via a checkerboard pattern (Dinh et al., 2016; Jacobsen et al., 2018), or various learned coarsening operators (Lensink et al., 2022; Etmann et al., 2020). The invertible pooling causes the fully hyperbolic invertible network to be less flexible than some other networks for image segmentation in two ways. First, the convolutional kernels become the dominant memory consumer when the network contains several down-sampling operators. A fully invertible hyperbolic network needs to increase the number of channels by a factor of eight to change the resolution by a factor of two in each direction in 3D. This preservation of the number of elements in the tensors makes the coarsening and channel-count changes invertible operations. However, this approach leads to an ‘explosion’ of the channels, as remarked by Peters et al. (2019a); Etmann et al. (2020). For instance, if the input is three-channel RGB, there are $192$ channels after two coarsening layers and an astonishing $98304$ channels in case we wish to coarsen five times. The storage and computations of the associated $98304^{2}$ convolutional kernels (just for one layer at the coarsest level) would be completely unfeasible. Figure 2 illustrates this effect.

A second way in which the fully invertible hyperbolic network is a relatively ‘rigid’ design is that an orthogonal transform can increase and later decrease the number of channels by $8\times$ per coarsening layer only, but we cannot arbitrarily reduce the number of channels and, thus network parameters. Similarly, one cannot increase the number of channels or make the network wider without decreasing resolution. This will often cause the network to contain too many or too few convolutional kernels for a certain task, given the network depth and the number of coarsening/refinement stages. The standard design of an invertible network outputs a tensor of the same size as its input and with the same number of channels. Therefore, one cannot directly apply invertible networks to applications like hyperspectral imaging that map 3D/4D inputs to a 2D output. Other applications that cannot directly work with invertible networks include ones that need to map to outputs with more channels than present in the input data or desire outputs that decrease/increase the resolution compared to the input data.

In this work, we present solutions to the above problems by combining the design of fully invertible networks with layers that can reduce the storage and computations related to the convolutional kernels. The same layer also serves as a way to increase the number of convolutional kernels per layer if required, without changing resolution. These two features make fully invertible hyperbolic networks much more flexible and fix the primary disadvantages. Furthermore, we show that we can, in fact, use invertible neural networks to change the resolution or the number of output channels while maintaining network invertibility.

Several examples illustrate how the presented tools enable the application of invertible neural networks to the following geoscientific problems: 1) time-lapse hyperspectral land-use-change detection, which maps 4D data to a 2D map; 2) large-scale 2D multi-model airborne-geophysical and remote sensing for aquifer map**, where we map from dozens of input channels to a couple of output classes; 3) geological model building from seismic data, where we set up the network so that it outputs a lower resolution compared to the input.

The examples illustrate that the developed tools extend the type of problems that can be handled using fully hyperbolic architectures, enabling training using larger data blocks as input for the network. This, in turn, allows us to work with higher-resolution inputs and learn larger-scale (spatial/temporal/harmonic) patterns that are present in the data.

1.1 Contributions

This work looks at some practical obstacles when applying fully invertible neural networks based on hyperbolic PDEs to large-scale remote sensing and geoscience problems. Specifically, we note our primary contributions as:

•

To the best of our knowledge, this is the first work that addresses the issue of the ‘exploding’ memory for convolutional kernels with an increasing number of resolution/pooling changes in a hyperbolic invertible network. Our solution keeps the network fully invertible while drastically reducing the number of convolutional kernels and associated memory and thus enables learning from data on a much larger scale than before while also working with arbitrarily deep networks.
•

We present a few subtle modifications that remove the limitation that fully invertible hyperbolic networks map between inputs and outputs of the same size/resolution and channel count while not giving up full invertibility and without increasing the computational cost or memory requirements.

After reviewing the design and some properties of the invertible hyperbolic neural network, we illustrate the limitations of the network structure. Then, we propose our solutions. Finally, we train on hyperspectral, multi-modality, and seismic datasets with sparse spatial label sampling.

2 Fully invertible hyperbolic neural networks for large input-output problems

The overwhelming majority of the literature relies on reverse-mode automatic differentiation for gradient computation. This type of gradient computation requires access to the network states $\mathbf{Y}_{j}$ at layer $j$ during the backpropagation phase. Standard implementations keep all network states in memory, causing the memory footprint to grow linearly with network depth. Workarounds to reduce memory may rely on shallow networks or a network design that maps a small patch or data sub-volume into the class of the central pixel/voxel.

Fully invertible networks based on PDEs require memory for just a couple of layer states, depending on the discretization. So, there is no longer a need to trade off network depth for input size. Memory savings by using invertible architectures allow us to allocate all available memory towards larger data input volumes, enabling the network to learn from large-scale structures in the data.

As discussed in the introduction, out of the various fully invertible network designs, our focus is on the physics-inspired invertible network based on the non-linear Telegraph equation (Zhou and Luo, 2018) with time-step $h$ . This equation is the basis for the invertible architecture of Ruthotto and Haber (2018); Chang et al. (2018).

\displaystyle\frac{\partial^{2}{\bf Y}}{\partial t^{2}}=f({\bf Y},{\boldsymbol% {\theta}}(t)),

(2.1)

where ${\bf Y}$ is the state, $f$ is a non-linear function, and the model parameters ${\boldsymbol{\theta}}(t)$ are time dependent. The model parameters often parameterize convolutional, block-convolutional, differential and dense matrices, denoted as ${\bf K}({\boldsymbol{\theta}}(t))$ . While many neural networks use nonlinearities of the type $f({\bf Y}(t),{\bf K}({\boldsymbol{\theta}}(t)))=\sigma({\bf Y}(t),{\bf K}({% \boldsymbol{\theta}}(t)))$ with a nonlinear, point-wise, and monotonically increasing activation function $\sigma:\mathbb{R}\rightarrow\mathbb{R}$ , we select the symmetric layer (Ruthotto and Haber, 2018) of the form

f({\bf Y},{\boldsymbol{\theta}}(t))=-{\bf K}({\boldsymbol{\theta}}(t))^{\top}% \sigma({\bf K}({\boldsymbol{\theta}}(t)){\bf Y}(t)).

(2.2)

With this choice, (2.1) becomes

\frac{\partial^{2}{\bf Y}}{\partial t^{2}}=-{\bf K}({\boldsymbol{\theta}}(t))^% {\top}\sigma({\bf K}({\boldsymbol{\theta}}(t)){\bf Y}(t)).

(2.3)

The motivation for the specific symmetric layer choice relates to stability and energy conservation of the forward propagation through the network; see Ruthotto and Haber (2018) for a stability proof.

To proceed, we follow Ruthotto and Haber (2018) and use the conservative Leapfrog discretization of the second derivative of the state,

\frac{\partial^{2}{\bf Y}}{\partial t^{2}}\approx{\frac{1}{h^{2}}}\left({\bf Y% }_{j+1}-2{\bf Y}_{j}+{\bf Y}_{j-1}\right),

(2.4)

where $h$ now indicates the artificial-time step and $j$ indexes the discrete time. Combining the discretization of the second derivative and (2.3) results in

	$\displaystyle\mathbf{Y}_{1}=\>$	$\displaystyle\mathbf{X},\quad\mathbf{Y}_{2}=\mathbf{X}$
	$\displaystyle\mathbf{Y}_{j}=\>$	$\displaystyle 2\mathbf{Y}_{j-1}-\mathbf{Y}_{j-2}-h^{2}\mathbf{K}(\boldsymbol{% \theta}_{j})^{\top}\sigma(\mathbf{K}(\boldsymbol{\theta}_{j})\mathbf{Y}_{j-1})$		(2.5)

Equation (2) is a hyperbolic network that uses a single resolution. The first two states are the initial conditions, which we set equal to the input data $\mathbf{X}\in\mathbb{R}^{n_{1}\times n_{2}\times n_{3}\times n_{\text{chan}}}$ . The data tensor has $n_{\text{chan}}$ channels and 3 other dimensions indexed by $n_{1}$ , $n_{2}$ , and $n_{3}$ . Examples include hyperspectral data and 3D seismic image volumes.

An artificial time-step $h$ affects the stability of the forward propagation (Haber and Ruthotto, 2017) via the well-known CFL condition (LeVeque, 1990). In this work, we set the linear operator $\mathbf{K}(\boldsymbol{\theta}_{j})$ to convolutions with kernels $\boldsymbol{\theta}_{j}$ , and select the ReLU as the activation function $\sigma(\cdot)$ for the examples.

In order to introduce multi-resolution into the system we use the approach by Lensink et al. (2022) and introduce the linear operators $\mathbf{W}_{j}$ that change the resolution of the state without losing information by moving the information from the spatial dimension to the channel dimension, obtaining the network

	$\displaystyle\mathbf{Y}_{1}=\>$	$\displaystyle\mathbf{X},\quad\mathbf{Y}_{2}=\mathbf{X}$
	$\displaystyle\mathbf{Y}_{j}=\>$	$\displaystyle 2\mathbf{W}_{j-1}\mathbf{Y}_{j-1}-\mathbf{W}_{j-2}\mathbf{Y}_{j-% 2}-h^{2}\mathbf{K}(\boldsymbol{\theta}_{j})^{\top}\sigma(\mathbf{K}(% \boldsymbol{\theta}_{j})\mathbf{W}_{j-1}\mathbf{Y}_{j-1}),\>j=3,\cdots,n.$		(2.6)

The operator $\mathbf{W}$ represents coarsening/pooling to change the resolution and the number of channels simultaneously. Note that $\mathbf{W}_{j}$ equals the identity if there is no resolution change at layer $j$ . Figure 1 illustrates an instance of the network design.

The operators $\mathbf{W}_{j}$ are important components of networks for image-to-image map**s, and they need to be invertible operators to enable full invertibility of the network. Practical invertible linear operators are orthogonal, do not require storage of their dense matrix representation, and have fast forward and inverse transforms known in closed form. Here, we select the orthogonal Haar wavelet transform (Truchetet and Laligant, 2004) $\mathbf{W}$ to coarsen the image and increase the number of channels simultaneously. This choice was used successfully in Lensink et al. (2022). The transpose (and inverse) achieves the reverse of these operations, $\mathbf{W}^{\top}=\mathbf{W}^{-1}$ and $\mathbf{W}^{T}\mathbf{W}=I=\mathbf{W}\mathbf{W}^{\top}$ , so the action of the linear operator $\mathbf{W}$ on a tensor $\mathbf{Y}$ creates the map**s

	$\displaystyle\mathbf{W}\mathbf{Y}$	$\displaystyle:\mathbb{R}^{n_{1}\times n_{2}\times n_{3}\times n_{\text{chan}}}% \rightarrow\mathbb{R}^{n_{1}/2\times n_{2}/2\times n_{3}/2\times 8n_{\text{% chan}}},$		(2.7)
	$\displaystyle\mathbf{W}^{-1}\mathbf{Y}$	$\displaystyle:\mathbb{R}^{n_{1}\times n_{2}\times n_{3}\times n_{\text{chan}}}% \rightarrow\mathbb{R}^{2n_{1}\times 2n_{2}\times 2n_{3}\times n_{\text{chan}}/% 8}.$		(2.8)

Because of the invertibility of any orthogonal transform, applying $\mathbf{W}$ and $\mathbf{W}^{\top}$ incurs no loss of information. As mentioned in the introduction, other invertible pooling or coarsening/refining operators are also available.

Refer to caption — Figure 1: Diagram of the flow of the multi-level hyperbolic network in (2). Shown for 3D input data with one channel and two levels. The network contains seven layers, with pooling after layer four and unpooling after layer six.

Invertibility of the full network is exploited by isolating one of the states in (2) and reversing the indices to obtain an expression for the current state in terms of future states:

\displaystyle\mathbf{Y}_{j}=\mathbf{W}_{j}^{-1}\bigg{[}

\displaystyle 2\mathbf{W}_{j+1}\mathbf{Y}_{j+1}-h^{2}\mathbf{K}(\boldsymbol{% \theta}_{j+2})^{\top}\sigma(\mathbf{K}(\boldsymbol{\theta}_{j+2})\mathbf{W}_{j% +1}\mathbf{Y}_{j+1})-\mathbf{Y}_{j+2}\bigg{]}.

(2.9)

This equation does not require inverting the activation function $\sigma$ . Instead, only the inversion of the orthogonal wavelet transform is required, which is known in closed form. When computing the gradient of the loss function using backpropagation, we recompute the states $\mathbf{Y}_{j}$ while propagating backwards through the network. The re-computation avoids the storage of all $\mathbf{Y}_{j}$ and leads to a fixed memory requirement for the states of three layers, see Table 1 for an overview.

Table 1: Memory requirements for the states

\mathbf{Y}_{j}

and convolutional kernels

\boldsymbol{\theta}

for fully invertible and non-invertible equivalent networks based on the networks in Table 2.

Memory	States	Conv. kernels

Hyperspectral
non invertible	$22.5$ GB	$0.02$ GB
invertible	$3.7$ GB	$0.02$ GB
invertible + BLR layers	$3.7$ GB	$0.003$ GB
Aquifer map**
non invertible	$21.02$ GB	$32.19$ GB
invertible	$1.66$ GB	$32.19$ GB
invertible + BLR layers	$1.66$ GB	$0.12$ GB
3D seismic
non invertible	$21.96$ GB	$41.16$ GB
invertible	$2.19$ GB	$41.16$ GB
invertible + BLR layers	$2.19$ GB	$0.23$ GB

The benefits of the hyperbolic network design range beyond just invertibility and memory savings. We point out that the network design described in this section can include instance/batch/layer normalization. However, we found it unnecessary for our examples. The following stability property from (Ruthotto and Haber, 2018, Thm. 2) can partially explain this observation.

Theorem 2.1 (Stability of the fully invertible hyperbolic network (2.3) and (2).).

The neural network satisfies the stability criterion

\|{\bf y}_{1}(t=T)-{\bf y}_{2}(t=T)\|_{2}^{2}\leq c\|{\bf y}_{1}(t=0)-{\bf y}_% {2}(t=0)\|_{2}^{2}

(2.10)

where ${\bf y}_{1}$ and ${\bf y}_{2}$ are two different initial states, given at times equal to zero and after propagating to time $T$ . The constant $c>0$ is independent of $t$ .

The above theorem is stated in (Ruthotto and Haber, 2018, Thm. 2) for time-independent weights $\mathbf{K}$ . Without stating the full proof, we illuminate the structure of the hyperbolic network, stability, and connection to other physical wave-equations. To start, rewrite (2.3) into a first-order system of equations by introducing the state ${\bf v}(t)$ and using ${\bf K}{\bf u}(t)=\partial_{t}{\bf v}(t)$ and $-{\bf K}^{\top}{\bf v}(t)=\partial_{t}{\bf u}(t)$ to arrive at

	$\displaystyle\partial_{t}\begin{bmatrix}{\bf u}(t)\\ {\bf v}(t)\end{bmatrix}$	$\displaystyle=\begin{bmatrix}I&0\\ 0&\sigma(\cdot)\end{bmatrix}\begin{bmatrix}0&-{\bf K}^{\top}\\ {\bf K}&0\end{bmatrix}\begin{bmatrix}{\bf u}(t)\\ {\bf v}(t)\end{bmatrix}$		(2.11)
		$\displaystyle\leftrightarrow\partial_{t}\bf{\bf x}(t)={\bf B}\circ{\bf A}{\bf x% }(t).$

The above immediately shows that the initial choice of network non-linearity $-{\bf K}(t)^{\top}\sigma({\bf K}(t){\bf Y}(t))$ leads to the anti-hermitian operator, or skew-symmetric matrix property $A^{\top}=-A$ . This property, in combination with 2.11, leads to a standard stability argument via the upper bound of the energy

\|{\bf B}\circ{\bf A}{\bf x}(t)\|_{2}^{2}\leq\|{\bf A}{\bf x}(t)\|_{2}^{2}\leq% \|{\bf A}\|_{2}^{2}\|{\bf x}(t)\|_{2}^{2}=c\|{\bf x}(t)\|_{2}^{2},

(2.12)

where the nonlinearity satisfies $|\sigma({\bf x})|\leq|{\bf x}|$ and $c$ is the constant $\|{\bf A}\|_{2}^{2}$ . This energy definition is equivalent to the one in (Evans, 2010, sect. 2.4.3) and Ruthotto and Haber (2018). The upper bound is the energy associated with the linear wave-like equation. The energy of a linear wave-like equation is constant in time, as shown by a standard argument using the anti-hermitian ${\bf A}$ as follows:

$\displaystyle\frac{\partial\\|{\bf x}(t)\\|_{2}^{2}}{\partial t}$	$\displaystyle=\frac{\partial}{\partial t}({\bf x}(t),{\bf x}(t))=(\partial_{t}% {\bf x}(t),{\bf x}(t))+({\bf x}(t),\partial_{t}{\bf x}(t))$
	$\displaystyle=({\bf A}{\bf x}(t),{\bf x}(t))+({\bf x}(t),{\bf A}{\bf x}(t))$
	$\displaystyle=({\bf A}{\bf x}(t),{\bf x}(t))-({\bf A}{\bf x}(t),{\bf x}(t))=0$	(2.13)

where the norm is induced by the inner product $\|{\bf x}\|_{2}^{2}=({\bf x},{\bf x})$ .

Thus, the energy of the hyperbolic network, in between resolution changes, is bounded from above by the energy of the linear wave-like equation, which itself is bounded. The orthogonal wavelet transforms also do not change the energy across resolution levels.

Below, we will verify the stability property for randomly initialized and trained networks.

2.1 Empirical verification of stability properties

The inherited stability of the invertible hyperbolic network as stated above suggests it is possible to train the networks for the examples without using any instance/batch/layer normalization. Here, we illustrate the stability, as well as invertibility and energy growth, for both randomly initialized and trained networks. Figure 3 displays empirically observed network properties for an untrained version of the ( $30$ layer, four-level) network described in Table 3, for a range of artificial time-step sizes $h$ . The function $g({\bf X},{\boldsymbol{\theta}})$ denotes the propagation of input ${\bf X}$ through the network, and $g^{-1}$ denotes the inverse/reverse propagation. The figures show the following properties:

•

Energy growth: $\frac{\|g({\bf X}_{0},{\boldsymbol{\theta}})\|_{2}}{\|{\bf X}_{0}\|_{2}}$ , which is the ratio of the norms of network input and output.
•

Stability: $\frac{\|g({\bf X},{\boldsymbol{\theta}})-g({\bf X}+\delta{\bf X},{\boldsymbol{% \theta}})\|_{2}}{\|\delta{\bf X}\|_{2}}$ , where $\delta{\bf X}$ is a random perturbation (selected with a magnitude of $0.1\|{\bf X}\|_{2}$ for the current experiment).
•

Invertibility error: $\frac{\|g^{-1}(g({\bf X},{\boldsymbol{\theta}}),{\boldsymbol{\theta}})-{\bf X}% \|_{2}}{\|{\bf X}\|_{2}}$

The results show that a network initialized with zero mean and normally distributed weights is stable, preserves energy relatively well, and is invertible in practice. For energy growth and stability, a U-net with the same number of levels, layers per level, and the same number of channels resulted in exploding outputs. We note that the results depend on the scaling of the random initialization and that a smart initialization like Glorot and Bengio (2010) can also improve the stability at the start of training. It can be verified that our network shows similarly good results for a wide range of scalings of random initial weights.

The same test after training the network revealed the following numbers for each property: energy growth: $7.42$ , which is a reasonably small number (note that energy growth is not expected to be smaller than one, see (2.12)); stability: mean of $1.05$ with std $0.004$ ; invertibility error: $0.00018$ . These numbers illustrate the favorable stability, energy growth and invertibility properties are preserved while training the network.

3 Problems and solutions for practical invertible networks

In the following, we propose solutions to the limitations of fully invertible multi-level hyperbolic networks as posed in the introduction: an exploding number of convolutional kernels, inability to deal with different input-output resolutions and channel numbers, and transformations between different data-label dimensions. Below, we address each of these problems and present solutions such that we can apply hyperbolic invertible networks anyway, without fundamentally altering the network design or giving up invertibility.

3.1 Exponential parameter growth with the number of resolution changes - a block low-rank approach

Orthogonal transforms that make neural networks fully invertible also preserve the number of entries after pooling/coarsening and unpooling/refining the state. Indeed, equations (2.7) and (2.8) show that if there are $n_{\text{chan}}^{2}$ convolutional kernels per network layer, starting with 3D input data with $n_{\text{chan (ini)}}$ channels leads to

n_{\text{chan}}=n_{\text{chan (ini)}}\times 8^{n_{\text{coarsening}}}

(3.14)

channels after coarsening/pooling $n_{\text{coarsening}}$ times. Equivalently, each time the network states coarsen, the number of convolutional kernels grows with a factor of $64$ . See Figure 2 for an illustration of this effect for various network designs as a function of input size, number of layers, and the number of coarsening steps. As noted by Peters et al. (2019a); Etmann et al. (2020), this results in prohibitive memory and computation times for storing and computing convolutional kernels, respectively.

From the above observations and Figure 2, it is clear that invertible hyperbolic networks with multiple coarsening stages can only be a practical tool if it is possible to reduce the memory requirements for the network parameters. Various memory reduction techniques for network weights have been developed, and below we discuss a few that are/are not suitable for our purposes. We also introduce a novel network layer defined by a low block rank.

To start, consider network compression methods such as pruning and low-rank matrix/tensor factorization (Denton et al., 2014) train a ’full’ network first, followed by compression of the weights. These ideas do not apply to our situation because we assume that it is infeasible to train the full network. Instead, we need methods that directly train a reduced-memory network. Prior work includes replacing a part of the convolutional kernels by scalars (Ephrath et al., 2019), and equip** the convolutional kernels with a block-circulant structure for parameter reduction (Ding et al., 2017; Treister et al., 2018), as well as employing Kronecker-product structures (Wu, 2016).

Here, we introduce another layer type that requires minimal adaptation to existing code, has easily adaptable memory requirements, and has a clear linear-algebraic interpretation. Inspired by techniques like LR-factorization for matrix completion (Rennie and Srebro, 2005; Aravkin et al., 2014), we construct layers with a block-low-rank structure formed by the convolutional kernels when structured in its flattened matrix form. This structure allows for explicit limitations on the number of convolutions in a layer while preserving a possibly extremely large number of channels. Our approach is similar to ‘squeeze-and-expand’ bottleneck methods like Szegedy et al. (2015); He et al. (2016); Iandola et al. (2016) proposed for non-invertible residual networks, mainly in the context of image classification. Our layer differs because we have one instead of two nonlinearities per layer, and our convolutional kernels have a symmetric structure (see Eq. (2.2)), which squeezenet does not have. If the nonlinearity is not symmetric, it will not induce an anti-hermitian operator in 2.1 and, therefore, cannot guarantee stability.

Also different is that we do not rely on $1\times 1\times 1$ convolutions for compression and expansion. Because we explicitly induce and recognize the block-low-rank structure in our network layer, we can directly observe various linear-algebraic properties of interest.

Linear algebraic matrix-vector product notation is required to interpret and construct the block-low-rank layer. The computational implementation can still use a 5D tensor format for the numerical examples. First, flatten the tensors that contain network states of type $\mathbb{R}^{n_{x}\times n_{y}\times n_{z}\times n_{\text{chan}}}$ to block-vectors $\mathbb{R}^{n_{x}n_{y}n_{z}n_{\text{chan}}}$ that contain $n_{\text{chan}}$ sub-vectors $Y^{i}\in\mathbb{R}^{n_{x}n_{y}n_{z}}$ ,

\mathbf{Y}\equiv\begin{bmatrix}Y^{1}\\ Y^{2}\\ \vdots\\ Y^{n_{\text{chan}}}\end{bmatrix}.

(3.15)

Similarly, we rewrite the convolutional kernels for a given layer in tensor format $\mathbb{R}^{n_{x}\times n_{y}\times n_{z}\times n_{\text{chan out}}\times n_{% \text{chan in}}}$ as the block matrix $\mathbf{K}$ with $n_{\text{chan out}}$ rows and $n_{\text{chan in}}$ columns. Each block $K(\theta^{i,k})$ is a Toeplitz matrix representation of the convolution with a kernel $\theta^{i,k}$ ,

\mathbf{K}\equiv\begin{bmatrix}K(\theta^{1,1})&K(\theta^{1,2})&\ldots&K(\theta% ^{1,n_{\text{chan in}}})\\ K(\theta^{2,1})&K(\theta^{2,2})&\ldots&K(\theta^{2,n_{\text{chan in}}})\\ \vdots&\vdots&\ddots&\vdots\\ K(\theta^{n_{\text{chan out}},1})&K(\theta^{n_{\text{chan out}},2})&\ldots&K(% \theta^{n_{\text{chan out}},n_{\text{chan in}}})\\ \end{bmatrix}.

(3.16)

This block matrix is square if the number of channels remains unchanged after the convolutions. To reduce the number of channels, $\mathbf{K}$ needs to be a flat matrix; a tall block-matrix increases the number of channels. The collection of convolutional kernels at layer $j$ is denoted by $\boldsymbol{\theta}_{j}$ .

Using this notation, the non-linear part of each layer in the hyperbolic network (2) becomes (as an example, consider a case with three input channels in ${\bf Y}_{j}$ but using only six convolutional kernels instead of the usual nine)

\begin{bmatrix}K(\theta^{1,1})^{\top}&K(\theta^{2,1})^{\top}\\ K(\theta^{1,2})^{\top}&K(\theta^{2,2})^{\top}\\ K(\theta^{1,3})^{\top}&K(\theta^{2,3})^{\top}\end{bmatrix}\sigma\Bigg{(}\begin% {bmatrix}K(\theta^{1,1})&K(\theta^{1,2})&K(\theta^{1,3})\\ K(\theta^{2,1})&K(\theta^{2,2})&K(\theta^{2,3})\\ \end{bmatrix}\begin{bmatrix}Y^{1}\\ Y^{2}\\ Y^{3}\\ \end{bmatrix}\Bigg{)}.

(3.17)

For simplicity, we assumed there is no coarsening via the Haar transform at this particular layer. The symmetric layer structure enables a layer to have the same number of input and output channels while working with fewer convolutional kernels than $n_{\text{chan}}^{2}$ if $\mathbf{K}$ is a tall block-matrix.

A benefit of our proposed construction is the following clear linear-algebraic interpretation:

The structure ${\bf K}_{j}^{\top}\sigma({\bf K}_{j}{\bf Y}_{j})$ using a block matrix $\mathbf{K}$ with $m\times n$ blocks and every block is a convolution matrix, induces several properties, including

•

the number of convolutional kernels in $\mathbf{K}$ is given by $mn$ , and the number of convolutions plus transposed convolutions is $2mn$ per layer.
•

the block rank of $\mathbf{K}^{\top}\mathbf{K}$ is at most $m$ .
•

the number of unique kernels in $\mathbf{K}^{\top}\mathbf{K}$ is at most $(n^{2}+n)/2$ .

Another key observation is that a non-square $\mathbf{K}$ does not affect the invertibility of the network because the rank-deficient matrix $\mathbf{K}^{\top}\mathbf{K}$ does not need to be inverted for the reverse propagation as defined in (2.9).

Including the symmetric block-low-rank layer extends the applicability of fully invertible hyperbolic networks to larger input data and an increased number of coarsening stages in the network. See Figure 2 for a summary of the memory savings.

3.2 Invertible hyperbolic networks with a different number of input and output channels.

To segment data into $n_{\text{class}}$ classes, one typically employs a neural network that maps the data to the last network state (output) $\mathbf{Y}\in\mathbb{R}^{n_{x}\times n_{y}\times n_{z}\times n_{\text{chan}}}$ , where we would choose $n_{\text{class}}=n_{\text{chan}}$ . The standard accompanying loss function is the multi-class cross-entropy loss

L(\mathbf{Y},\mathbf{C})=-\sum_{(i,j,k)}\sum_{l=1}^{n_{\text{chan}}}\mathbf{C}% _{i,j,k,l}\log(\mathbf{Y}_{i,j,k,l}),

(3.18)

where $\mathbf{C}\in\mathbb{R}^{n_{x}\times n_{y}\times n_{z}\times n_{\text{class}}}$ represent the labels, and $i$ , $j$ , $k$ are spatial indices.

In the case of invertible neural networks, the situation is more complicated. The standard form of the fully invertible hyperbolic network outputs the same tensor size as the input, i.e., the number of input and output channels is equal. This is limiting because the data rarely has the same number of input channels as output classes. Fortunately, invertible networks can still be used with a minor modification to the loss function.

Consider data with $n_{\text{chan}}$ channels input, so there are also $n_{\text{chan}}$ output channels, assuming the network contains as many forward as inverse transformations. To train a network for segmenting the data into $n_{\text{class}}$ classes, we need $n_{\text{chan}}\geq n_{\text{class}}$ . To deal with the mismatch in channel count and the number of classes, we can compute the partial cross-entropy loss

L(\mathbf{Y},\mathbf{C})=-\sum_{(i,j,k)}\sum_{l=1}^{n_{\text{class}}}\mathbf{C% }_{i,j,k,l}\log(\mathbf{Y}_{i,j,k,l}).

(3.19)

This loss function simply computes the loss via summation over the $n_{\text{class}}$ ‘active’ channels only that correspond to labels. The other channels also have an output which is not used to compute a loss or gradient. These channels are still required to reverse the direction of the network, i.e., all channels are required to generate input data from a prediction (and the second to last layer).

The above assumed that $n_{\text{chan}}\geq n_{\text{class}}$ , which is generally not the case. The number of input channels can be increased by

•

duplicating (some of) the channels of the input data. This is a valid strategy, but it increases the total memory load as each layer contains more channels. It follows that the input data size grows linearly with the number of duplicated input data channels.
•

transforming the data into more channels while preserving the information. We can achieve these goals by taking a Haar transform (or another orthogonal transform) that reduces the data resolution and increases the number of channels while kee** the memory load constant. This implies applying the transform to data $\mathbf{Y}$ changes the sizes according to $\mathbf{W}\mathbf{Y}:\mathbb{R}^{n_{x}\times n_{y}\times n_{z}\times n_{\text{% chan}}}\rightarrow\mathbb{R}^{n_{x}/2\times n_{y}/2\times n_{z}/2\times 8n_{% \text{chan}}}$ . Multiple transforms applied in sequence further increase the number of channels if required. To emphasize, this transform is applied to the data before it enters the network. It is, therefore, a one-time operation and does not add significant computational training time. This option is more desirable in terms of memory.

Suppose one is willing to give up invertibility. In that case, we can simply add a non-square linear operator to map the output of the invertible network to the desired number of output channels for the loss, albeit at a higher memory cost and loss of stability guarantees as in Thm. 2.1.

3.3 Tasks with different input and output dimensions.

Applications like hyperspectral land-use segmentation intrinsically reduce the dimensionality between input data and output by collapsing the frequency axis into a point, i.e., $\mathbb{R}^{n_{x}\times n_{y}\times n_{\text{freq}}}\rightarrow\mathbb{R}^{n_{% x}\times n_{y}}$ . In the case of time-lapse hyperspectral segmentation, there is also a reduction along the time axis as $\mathbb{R}^{n_{x}\times n_{y}\times n_{\text{freq}}\times n_{t}}\rightarrow% \mathbb{R}^{n_{x}\times n_{y}}$ .

The above tasks seem incompatible with the fully invertible hyperbolic neural network that outputs a tensor the same size as the input. This time, we propose to measure the loss over a single slice in the output tensor. That is, embed the known ground-truth labels that depend on the class and spatial coordinates $x$ and $y$ in a larger label tensor $\mathbf{C}\in\mathbb{R}^{n_{x}\times n_{y}\times n_{\text{freq}}\times n_{% \text{chan}}}$ at slice $p$ as $\mathbf{C}_{:,:,p,1:n_{\text{class}}}$ . All other entries in the label tensor do not exist and do not contribute to the loss or its gradient computation. See Peters et al. (2019b, c) for more information and applications of partial loss functions. The resulting multi-class cross-entropy function reads

L(\mathbf{Y},\mathbf{C})=-\sum_{(i,j)}\sum_{l=1}^{n_{\text{class}}}\mathbf{C}_% {i,j,p,l}\log(\mathbf{Y}_{i,j,p,l}),

(3.20)

where $p$ is the fixed tensor-slice index that contains the labels. When only sparse spatial location indices of known labels are available, the sum over $(i,j)$ reduces to a sum over the subset of labeled pixels.

3.4 Resolution changes between input and output.

Some applications desire a different resolution for output compared to the input. Once more, straightforward application of the fully invertible hyperbolic network cannot accomplish resolution changes between input and output because the input and output tensors have the same size. It turns out that there is still a way to train an invertible network such that input and output are on different resolutions. To start, consider a single network layer of an invertible hyperbolic network that decreases the resolution via

\mathbf{Y}_{3}=2\mathbf{W}\mathbf{Y}_{2}-\mathbf{W}\mathbf{Y}_{1}-h^{2}\mathbf% {K}_{3}^{\top}\sigma(\mathbf{K}_{3}\mathbf{W}\mathbf{Y}_{2}),

where $\mathbf{W}$ is an orthogonal transform that transforms and reorganizes the 4D input as $\mathbf{W}\mathbf{Y}:\mathbb{R}^{n_{1}\times n_{2}\times n_{3}\times n_{\text{% chan}}}\rightarrow\mathbb{R}^{n_{1}/2\times n_{2}/2\times n_{3}/2\times 8n_{% \text{chan}}}$ . Note that the first output channel of the single-level Haar transform is the input on a resolution reduced by a factor of two. Similarly, for other transforms like the pixel-shuffle transform. So as long as the network contains more forward transforms than inverse transforms, the output resolution is lowered by increments of two. Network training can utilize this concept by defining the loss function over a particular selection of output channels. Similar logic applies to resolution increases.

4 Examples

The foundation of the following experiments is the fully invertible hyperbolic network, implemented by Witte et al. (2020); Orozco et al. (2023). Its specialization to geoscientific and remote sensing problems is available at https://github.com/PetersBas/FHN_Examples. The experiments illustrate that the proposed techniques enable the application of fully invertible hyperbolic networks to various large-scale geoscience problems while obtaining satisfactory results. Table 1 summarizes the memory requirements for the states and convolutional kernels, not including memory allocations for intermediate computations, Lagrangian multipliers for gradient computations and the gradient itself. The table shows that both invertibility and block-low-rank layers are required to run examples on a standard GPU, like a $24$ GB NVIDIA GeForce RTX 3090. Figure 2 shows the general memory scaling of our approach. In order to train a non-invertible network with the same number of channels per layer and the same input data size, the number of layers needs to be reduced to keep the memory footprint manageable.

4.1 Time-lapse hyperspectral land-use change detection

The data $\mathbf{X}\in\mathbb{R}^{n_{x}\times n_{y}\times n_{\text{freq}}\times n_{t}}$ , has two spatial coordinates $n_{x}$ and $n_{y}$ , the third dimension corresponds to frequency, and there is one channel per time of data collection (two in this example). Figure 4 displays this data set (Hasanlou and Seydi, 2018). We follow common practice in hyperspectral imaging literature, where part of the segmentation is assumed known. The red and white dots in Figure 5 show the $70$ known label locations, and training utilizes $50$ annotations, while the validation is based on the remaining $20$ .

This example aims to predict the land-use change on a coarser grid. Table 2 contains the network details that correspond to a network that contains one more Haar transform than inverse Haar transforms, so the output feature resolution decreases by a factor of two. We use stochastic gradient descent with momentum and a decaying learning rate for $70$ iterations to minimize the loss function (3.20). The loss is measured over a few slices of the output tensor that embeds the labels. Figure 5 shows true land-use change, prediction, and errors. Aside from some boundary artifacts, a few farm fields were classified incorrectly. The low-memory nature of the invertible network enabled us to input the entire 4D data in one chunk; see also Table 1 for details regarding memory usage. We also show the validation loss in Figure 6, as well as the validation losses for three comparison multi-level ResNets (see the Appendix for the network details). Because of the memory constraints, the ResNets need to be significantly shorter than the invertible network with BLR layers. The losses for the various ResNets are comparable but do not approach the loss of the proposed network.

4.2 Regional-scale aquifer map**

The task is to delineate large aquifers in Arizona, USA; see Figure 9. The two classes are 1) basin and range / Colorado Plateau aquifer; 2) no aquifer (Robson and Banta, 1995). The survey area is most of the state. Aircraft and satellite-based sensors collected magnetic data, two types of gravity measurements, and the topography, see Figure 7. Besides these remotely acquired data, we supplement two types of geological maps: one map in terms of the rock age and one in terms of rock types. The advantage of using geological maps is that they incorporate expert knowledge into our data. Geologists construct these maps by synthesizing their geological knowledge with ground truth observations, hyperspectral data, and various airborne and land-based geophysical surveys. Because the geological maps in Figure 7 are not invariant under the permutation of the class numbers, we use its one-hot encoding (52 separate maps).

Table 2: Network designs for the fully invertible networks.

		Hyperspectral
Layer	Channels	Block rank	Feature size
1-6	16	16	$368\times 288\times 184$
7-18	128	16	$184\times 144\times 92$
		Aquifer map**
Layer	Channels	Block rank	Feature size
1-4	112	24	$848\times 1456$
5-7	448	24	$424\times 728$
8-10	1792	24	$212\times 364$
11-28	7168	24	$106\times 182$
29-32	1792	24	$212\times 364$
32-34	448	24	$424\times 728$
35-39	112	24	$848\times 1456$

The experimental setting assumes an expert annotated the aquifers in a few patches, see Figure 8, so the neural network assists a domain expert by interpolating and extrapolating limited annotation. Training is similar to the previous example: SGD with momentum with a decaying learning rate for $140$ iterations to minimize the multi-class cross-entropy loss. Each iteration uses about $10\%$ of the known labels (randomly selected) to compute an approximation of the loss and the gradient. We also augment the data with random flips and permutations. The network details can be found in Table 2. The $39$ layer fully invertible hyperbolic network uses three coarsening stages (see table 2) to increase the receptive field size and enable information to propagate over larger spatial distances. Figure 9 displays the results and errors. Most of the errors are concentrated along some of the geological rock-type boundaries.

This example showed that neural networks could assist domain experts and mimic their work for aquifer map**. Invertible networks can deal with such large computational domains with many input channels in one chunk. The invertibility reduced the memory required for storing the network states from $21.02$ GB to just $1.66$ GB, see table 1. For this example with many input channels and multiple coarsening stages, training the parameters of the network in a compressed/factorized form directly used just $0.12$ GB for storing the convolutional kernels instead of the $32.19$ GB for storing unreasonably many convolutional kernels that a standard invertible hyperbolic network would require for this particular design, see table 1.

4.3 3D interpolation-segmentation of a seismic image volume from borehole information.

Building a 3D geological model from seismic imaging means grou** several layers, structures, or geological units to obtain a simplified geological model that conveys the information of interest. The interpretation of seismic volumes is challenging due to imaging artifacts (data noise, violation of assumptions in the imaging algorithm, poor illumination from seismic waves), spatial variation in the appearance of the interfaces, discontinuities of the interface due to geological faults, and a lack of ground-truth away from boreholes.

Here, we present an experiment aiming to segment the full 3D seismic volume (Figure 10) from borehole information. We assume that a small area around each borehole was interpreted by an expert and can serve as training and validation labels (Figure 10). Interpreting the seismic image close to boreholes is relatively easy due to the proximity to the ground truth.

While seismic interpretation from borehole information is nothing new, most work maps 2D image-to-image. Some 3D seismic interpretation approaches operate 3D-to-3D but are limited to training with relatively small 3D sub-cubes due to memory limitations and use sub-cubes of up to $128^{3}$ (Wu et al., 2019; Zhao, 2019; Shi et al., 2018; Wang and Nealon, 2019; Gao et al., 2021; Dou et al., 2022). Using larger 3D sub-cubes enables learning larger-scale structures present in the data. Peters and Haber (2020) show the first 3D interpretation approach using an invertible network and use an input size of up to $192\times 192\times 288\times 3$ . Here, we show results from training on the largest inputs to date that are about $5.7\times$ larger than previous work.

The full data has a size of $401\times 701\times 248$ . For training, we replicate the data into $12$ channels and input to the network a randomly selected sub-cube of size $248\times 248\times 248\times 12$ . Table 3 lists the network details. We reduce the cross-entropy loss using the ADAM optimizer for $240$ iterations with a decaying stepsize. Each iteration selects a randomly located sub-cube. Table 4 displays the final results’ intersection over union (IoU). Figure 11 shows 2D cross-sections from the 3D volume, accompanied by the true labels. The final prediction results from inference on the full data volume split into a couple of overlap** pieces. This example shows that we can obtain good segmentation results from partial labeling while training on large 3D input sub-cubes. The size of the sub-cubes is important because selecting only small sub-cubes would lead to most cubes containing no labels. Larger sub-cubes can connect more of the data to labeled locations.

Table 3: Network design for the fully invertible network for the 3D seismic interpretation example.

Layer	Block rank	Channels	Feature size
1-2	8	12	$248\times 248\times 248$
3-5	16	96	$124\times 124\times 124$
6-8	32	768	$62\times 62\times 62$
9-18	32	6144	$31\times 31\times 31$
19-21	32	768	$62\times 62\times 62$
22-24	16	96	$124\times 124\times 124$
25-30	8	12	$248\times 248\times 248$

Table 1 displays the memory usage of our network and equivalent standard invertible and non-invertible networks. The table also shows that a direct comparison with the equivalent non-invertible version of the hyperbolic network, and without block-low-rank layers is not possible on a standard $\approx 24$ GB GPU. Instead, we evaluated two closely related networks that just fit on the GPU. These networks are similar to ours (Table 3), except with one less level and a few layers shorter. See the Appendix for network details and Figure 13 for prediction images. Table 4 contains the statistics in terms of intersection over union (IoU), which shows that using one level less or reducing the number of layers to make the network fit/trainable on a GPU, comes at the cost of a significant drop in IoU score.

Table 4: Validation IoU values for the 3D seismic example. Invertible + BLR layer network design from Table 3 is according to Equation (2). The memory requirements for this network and its non-invertible equivalent without block-low-rank layers are shown in Table 1. See the Appendix for the design details of networks

{}^{a}\&^{b}

Network type	Validation IoU
Invertible + BLR layers (ours)	class 1/class 2: $0.97$ / $0.96$
Non-invertible equivalent, no BLR layers	out of memory
Largest non-invertible related^a	class 1/class 2: $0.91$ / $0.85$
Largest non-invertible related^b	class 1/class 2: $0.92$ / $0.88$

This example also showed that multi-level, fully invertible hyperbolic networks are suitable for 3D seismic segmentation and that the more than six thousand channels are no issue if we use symmetric layers with a low block-rank.

4.4 Effect of the selection of the maximum block-rank

The previous examples utilize a block rank of $\mathbf{K}^{\top}\mathbf{K}$ that is much below full rank. Naturally, questions arise about selecting the block rank and determining the sensitivity of the network performance to the block rank.

The maximum rank we can practically select is limited by the available memory to store convolutional kernels, or by the computational time that is available. Figure 12 provides a more quantitative and experimental answer. The figure shows all experiments repeated for various values of the maximum block-rank and averaged over five random initializations of the network parameters. The conclusion is that a very low block rank generally comes at the cost of slightly reduced prediction quality, as measured using intersection over union. A high block rank can also slightly reduce the prediction quality because it generates less implicit regularization (no other forms of regularization were used in the numerical experiments.) We finalize the experimental evaluation by noting that the IoU results show clear trends, but the assessment of the geoscientific and remote sensing examples remains challenging because the seismic, aquifer, and hyperspectral labels are expert interpretations of the data, and not ground truth. The aquifer map** example is based on various data sources, including ones not fed into the network. All of these come with some ambiguity.

5 Conclusions

The high memory requirements for training a deep neural network using automatic differentiation is a critical issue that limits the application to large input-data blocks like hyperspectral data, very large-scale multi-modality 2D geoscientific maps and airborne-remote sensing, as well as 3D seismic imagery. Fully invertible neural networks mostly solve the memory requirement issues for network states and achieve a constant memory footprint independent of the number of layers and pooling/coarsening stages inside the network.

This work takes a closer look at the fully invertible hyperbolic network based on a conservative leapfrog discretization of the non-linear hyperbolic telegraph equation. Problems are uncovered that prevent the direct application of fully invertible hyperbolic networks to tasks that require multiple coarsening/pooling stages in the network, applications that reduce a 3D/4D tensor into a 2D map of the earth, tasks with different numbers of input and output channels, and applications with resolution changes. For each of these issues, we provided a solution that enables the application of the network without fundamentally altering its design.

We introduce a layer design where the matrix representation of the convolutional kernels has a low block-rank structure. This design changes the exponential growth of the memory for convolutional kernels as a function of the number of pooling stages into a tuneable memory footprint. Changing the number of channels, dimensions, and resolution between input and output is enabled by embedding the labels into larger tensors in combination with particular ways to measure the loss, and using a different number of forward and inverse orthogonal transforms inside the network.

Examples illustrate how to apply fully invertible networks to time-lapse hyperspectral data with a resolution change, very large-scale multi-modal remote sensing for sub-surface aquifer map**, and 3D seismic interpretation. The tools developed in this work enable invertible networks to be applied to these problems and thus learn from larger input blocks of data, which in turn enables the network to learn from larger-scale structures present in the data.

References

Kervadec et al. [2019] Hoel Kervadec, Jose Dolz, Meng Tang, Eric Granger, Yuri Boykov, and Ismail Ben Ayed. Constrained-cnn losses for weakly supervised segmentation. Medical Image Analysis, 54:88–99, 2019. ISSN 1361-8415. doi:https://doi.org/10.1016/j.media.2019.02.009. URL https://www.sciencedirect.com/science/article/pii/S1361841518306145.
Peters [2022] Bas Peters. Point-to-set distance functions for output-constrained neural networks. J. Appl. Numer. Optim, 4(2):175–201, 2022.
Jia et al. [2021] Fan Jia, Jun Liu, and Xue-Cheng Tai. A regularized convolutional neural network for semantic image segmentation. Analysis and Applications, 19(01):147–165, 2021. doi:10.1142/S0219530519410148. URL https://doi.org/10.1142/S0219530519410148.
Lensink et al. [2022] Keegan Lensink, Bas Peters, and Eldad Haber. Fully hyperbolic convolutional neural networks. Research in the Mathematical Sciences, 9(4):1–22, 2022.
Ruthotto and Haber [2018] Lars Ruthotto and Eldad Haber. Deep neural networks motivated by partial differential equations. Journal of Mathematical Imaging and Vision, pages 1–13, 2018.
Behrmann et al. [2019] Jens Behrmann, Will Grathwohl, Ricky T. Q. Chen, David Duvenaud, and Joern-Henrik Jacobsen. Invertible residual networks. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 573–582. PMLR, 09–15 Jun 2019. URL https://proceedings.mlr.press/v97/behrmann19a.html.
Etmann et al. [2020] Christian Etmann, Rihuan Ke, and Carola-Bibiane Schönlieb. iunets: Learnable invertible up- and downsampling for large-scale inverse problems. In 2020 IEEE 30th International Workshop on Machine Learning for Signal Processing (MLSP), pages 1–6, 2020. doi:10.1109/MLSP49062.2020.9231874.
Jacobsen et al. [2018] Jörn-Henrik Jacobsen, Arnold W.M. Smeulders, and Edouard Oyallon. i-revnet: Deep invertible networks. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=HJsjkMb0Z.
van de Leemput et al. [2019] Sil C van de Leemput, Jonas Teuwen, Bram van Ginneken, and Rashindra Manniesing. Memcnn: A python/pytorch package for creating memory-efficient invertible neural networks. Journal of Open Source Software, 4(39):1576, 2019.
Chang et al. [2018] Bo Chang, Lili Meng, Eldad Haber, Lars Ruthotto, David Begert, and Elliot Holtham. Reversible architectures for arbitrarily deep residual neural networks. Proceedings of the AAAI Conference on Artificial Intelligence, 32, 2018. doi:10.1609/aaai.v32i1.11668. URL https://ojs.aaai.org/index.php/AAAI/article/view/11668.
Gomez et al. [2017] Aidan N Gomez, Mengye Ren, Raquel Urtasun, and Roger B Grosse. The reversible residual network: Backpropagation without storing activations. In Adv Neural Inf Process Syst, pages 2211–2221, 2017.
Dinh et al. [2016] Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density estimation using real NVP. CoRR, abs/1605.08803, 2016. URL http://arxiv.longhoe.net/abs/1605.08803.
Peters et al. [2019a] Bas Peters, Eldad Haber, and Keegan Lensink. Symmetric block-low-rank layers for fully reversible multilevel neural networks. arXiv preprint arXiv:1912.12137, 2019a.
Zhou and Luo [2018] Yanjie Zhou and Zhendong Luo. A crank–nicolson collocation spectral method for the two-dimensional telegraph equations. Journal of Inequalities and Applications, 2018:1–17, 2018.
Haber and Ruthotto [2017] Eldad Haber and Lars Ruthotto. Stable architectures for deep neural networks. Inverse Problems, 34(1):014004, dec 2017. doi:10.1088/1361-6420/aa9a90.
LeVeque [1990] R.J. LeVeque. Numerical Methods for Conservation Laws. Birkhauser, 1990.
Truchetet and Laligant [2004] Frederic Truchetet and Olivier Laligant. Wavelets in industrial applications: a review. Wavelet Applications in Industrial Processing II, 5607, 2004. doi:10.1117/12.580395. URL https://doi.org/10.1117/12.580395.
Evans [2010] L.C. Evans. Partial Differential Equations. Graduate studies in mathematics. American Mathematical Society, 2010. ISBN 9780821849743. URL https://books.google.ca/books?id=Xnu0o_EJrCQC.
Glorot and Bengio [2010] Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. In Yee Whye Teh and Mike Titterington, editors, Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, volume 9 of Proceedings of Machine Learning Research, pages 249–256, Chia Laguna Resort, Sardinia, Italy, 13–15 May 2010. PMLR. URL https://proceedings.mlr.press/v9/glorot10a.html.
Denton et al. [2014] Emily L Denton, Wojciech Zaremba, Joan Bruna, Yann LeCun, and Rob Fergus. Exploiting linear structure within convolutional networks for efficient evaluation. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 1269–1277. Curran Associates, Inc., 2014. URL http://papers.nips.cc/paper/5544-exploiting-linear-structure-within-convolutional-networks-for-efficient-evaluation.pdf.
Ephrath et al. [2019] Jonathan Ephrath, Lars Ruthotto, Eldad Haber, and Eran Treister. Leanresnet: A low-cost yet effective convolutional residual networks. arXiv preprint arXiv:1904.06952, 2019.
Ding et al. [2017] C. Ding, S. Liao, Y. Wang, Z. Li, N. Liu, Y. Zhuo, C. Wang, X. Qian, Y. Bai, G. Yuan, X. Ma, Y. Zhang, J. Tang, Q. Qiu, X. Lin, and B. Yuan. Circnn: Accelerating and compressing deep neural networks using block-circulant weight matrices. In 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 395–408, Oct 2017.
Treister et al. [2018] Eran Treister, Lars Ruthotto, Michal Sharoni, Sapir Zafrani, and Eldad Haber. Low-cost parameterizations of deep convolutional neural networks. arXiv preprint arXiv:1805.07821, 2018.
Wu [2016] Jia-Nan Wu. Compression of fully-connected layer in neural network by kronecker product. In 2016 Eighth International Conference on Advanced Computational Intelligence (ICACI), pages 173–179, 2016. doi:10.1109/ICACI.2016.7449822.
Rennie and Srebro [2005] Jasson DM Rennie and Nathan Srebro. Fast maximum margin matrix factorization for collaborative prediction. In Proceedings of the 22nd international conference on Machine learning, pages 713–719. ACM, 2005.
Aravkin et al. [2014] Aleksandr. Aravkin, Rajiv. Kumar, Hassan. Mansour, Ben. Recht, and Felix J. Herrmann. Fast methods for denoising matrix completion formulations, with applications to robust seismic data interpolation. SIAM Journal on Scientific Computing, 36(5):S237–S266, 2014. doi:10.1137/130919210. URL https://doi.org/10.1137/130919210.
Szegedy et al. [2015] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1–9, June 2015.
He et al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016.
Iandola et al. [2016] Forrest N Iandola, Song Han, Matthew W Moskewicz, Khalid Ashraf, William J Dally, and Kurt Keutzer. Squeezenet: Alexnet-level accuracy with 50x fewer parameters and< 0.5 mb model size. arXiv preprint arXiv:1602.07360, 2016.
Peters et al. [2019b] Bas Peters, Justin Granek, and Eldad Haber. Multiresolution neural networks for tracking seismic horizons from few training images. Interpretation, 7(3):SE201–SE213, 2019b. doi:10.1190/INT-2018-0225.1. URL https://doi.org/10.1190/INT-2018-0225.1.
Peters et al. [2019c] Bas Peters, Eldad Haber, and Justin Granek. Neural networks for geophysicists and their application to seismic data interpretation. The Leading Edge, 38(7):534–540, 2019c. doi:10.1190/tle38070534.1. URL https://doi.org/10.1190/tle38070534.1.
Witte et al. [2020] Philipp Witte, grizzuti, Mathias Louboutin, Ali Siahkoohi, and Felix Herrmann. slimgroup/InvertibleNetworks.jl: Jacobians and adjoint Jacobians of layers and networks, November 2020. URL https://doi.org/10.5281/zenodo.4298853.
Orozco et al. [2023] Rafael Orozco, Philipp Witte, Mathias Louboutin, Ali Siahkoohi, Gabrio Rizzuti, Bas Peters, and Felix J Herrmann. Invertiblenetworks. jl: A julia package for scalable normalizing flows. arXiv preprint arXiv:2312.13480, 2023.
Hasanlou and Seydi [2018] Mahdi Hasanlou and Seyd Teymoor Seydi. Hyperspectral change detection: an experimental comparative study. International Journal of Remote Sensing, 39(20):7029–7083, 2018. doi:10.1080/01431161.2018.1466079. URL https://doi.org/10.1080/01431161.2018.1466079.
Robson and Banta [1995] Stanley G. Robson and Edward R. Banta. Ground water atlas of the united states: Segment 2, arizona, colorado, new mexico, utah. Technical report, U.S. Geological Survey, 1995. URL http://pubs.er.usgs.gov/publication/ha730C.
Wu et al. [2019] Xinming Wu, Luming Liang, Yunzhi Shi, and Sergey Fomel. Faultseg3d: Using synthetic data sets to train an end-to-end convolutional neural network for 3d seismic fault segmentation. GEOPHYSICS, 84(3):IM35–IM45, 2019. doi:10.1190/geo2018-0646.1. URL https://doi.org/10.1190/geo2018-0646.1.
Zhao [2019] Tao Zhao. 3d convolutional neural networks for efficient fault detection and orientation estimation. In SEG Technical Program Expanded Abstracts 2019, pages 2418–2422, 2019. doi:10.1190/segam2019-3216307.1. URL https://library.seg.org/doi/abs/10.1190/segam2019-3216307.1.
Shi et al. [2018] Yunzhi Shi, Xinming Wu, and Sergey Fomel. Automatic salt-body classification using deep-convolutional neural network. In SEG Technical Program Expanded Abstracts 2018, pages 1971–1975, 2018. doi:10.1190/segam2018-2997304.1. URL https://library.seg.org/doi/abs/10.1190/segam2018-2997304.1.
Wang and Nealon [2019] Enning Wang and Jeff Nealon. Applying machine learning to 3d seismic image denoising and enhancement. Interpretation, 7(3):SE131–SE139, 2019. doi:10.1190/INT-2018-0224.1. URL https://doi.org/10.1190/INT-2018-0224.1.
Gao et al. [2021] Kai Gao, Lianjie Huang, Yingcai Zheng, Rongrong Lin, Hao Hu, and Trenton Cladohous. Automatic fault detection on seismic images using a multiscale attention convolutional neural network. Geophysics, 87(1):N13–N29, 11 2021. ISSN 0016-8033. doi:10.1190/geo2020-0945.1. URL https://doi.org/10.1190/geo2020-0945.1.
Dou et al. [2022] Yimin Dou, Kewen Li, Jianbing Zhu, Timing Li, Shaoquan Tan, and Zongchao Huang. Md loss: Efficient training of 3-d seismic fault segmentation network under sparse labels by weakening anomaly annotation. IEEE Transactions on Geoscience and Remote Sensing, 60:1–14, 2022. doi:10.1109/TGRS.2022.3196810.
Peters and Haber [2020] Bas Peters and Eldad Haber. Fully reversible neural networks for large-scale 3d seismic horizon tracking. In EAGE 2020 Annual Conference & Exhibition Online, volume 2020, pages 1–5. European Association of Geoscientists & Engineers, 2020.

Appendix A Additional experimental details

Here, we provide more details regarding the comparison network designs. Table 1 shows the memory requirements for the fully invertible hyperbolic network with block-low-rank (BLR) layers. Table 1 also shows the memory in case we do not use the invertibility of the network to compute gradients, and if we do not use BLR layers. Details for the comparison networks in the numerical experiments section are provided below.

A.1 Hyperspectral time-lapse

We compared the results of our proposed network to the symmetric ResNet [Haber and Ruthotto, 2017]

\mathbf{Y}_{j}=\>\mathbf{Y}_{j-1}-h\mathbf{K}(\boldsymbol{\theta}_{j})^{\top}% \sigma(\mathbf{K}(\boldsymbol{\theta}_{j})\mathbf{Y}_{j-1}).

(1.21)

The hyperspectral example requires a two-level network with one pooling operation. We employ the BLR layers, as in the fully invertible hyperbolic network. However, because the ResNet relies on automatic differentiation and not on invertibility for gradient computations, it is impossible to fit the same number of network layers on the GPU. Therefore, we have to shorten the network. Figure 6 compared three ResNets, and Table 5 describes the design details.

Comparison ResNets for the Hyperspectral example
Network	$\#$ layers level 1	$\#$ layers level 2
1	3	4	Fig. 6
2	2	5	Fig. 6
3	5	2	Fig. 6
4	6	1	out-of-memory
5	3	5	out-of-memory
6	5	3	out-of-memory

Table 5: Network designs for the comparison ResNets designs according to Eq. 1.21. These are all two-level networks, i.e., a few ResNet blocks followed by a pooling operation and some more ResNet blocks.

A.2 3D Seismic Interpretation

Memory-wise, it is impossible to train the seismic segmentation example on most GPUs if we are not using invertibility and BLR kernels, because the convolutional kernels then require 41.16 GB and just the states require 21.96 GB. These numbers do not include memory for intermediate computations, the gradient itself, and other storage for the optimizer. For a comparison, we construct the largest possible networks that fit on a $24$ GB GPU, that are as similar as possible to the network in Table 3. We modify the design by reducing the number of levels by one level, and shortening the network. Network a uses BLR layers, while network b does not. We refer back to Table 4 for the performance metrics, which show the networks with fewer layers and one level less significantly underperform the proposed network. See Figure 13 for predictions from the shorter network.

Table 6: Network designs that just fit on a

24

GB GPU, to compare to our fully invertible network for the 3D seismic interpretation example.

network a
Layer	Block rank	Channels	Feature size
1-2	8	12	$248\times 248\times 248$
3-5	16	96	$124\times 124\times 124$
6-17	32	768	$62\times 62\times 62$
18-20	16	96	$124\times 124\times 124$
21-26	8	12	$248\times 248\times 248$
network b
1-2	12	12	$248\times 248\times 248$
3-4	96	96	$124\times 124\times 124$
5-6	768	768	$62\times 62\times 62$
7-8	96	96	$124\times 124\times 124$
9-14	12	12	$248\times 248\times 248$