HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.

  • failed: tikzscale

Authors: achieve the best HTML results from your LaTeX submissions by following these best practices.

License: arXiv.org perpetual non-exclusive license
arXiv:2309.17182v2 [cs.LG] 07 Mar 2024

RECOMBINER: Robust and Enhanced
Compression with Bayesian Implicit Neural
Representations

Jiajun He
University of Cambridge
[email protected] &Gergely Flamich*{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT                         
University of Cambridge
[email protected] &Zongyu Guo
University of Science and Technology of China
[email protected] &José Miguel Hernández-Lobato
University of Cambridge
[email protected]
equal contribution.
Abstract

COMpression with Bayesian Implicit NEural Representations (combiner) is a recent data compression method that addresses a key inefficiency of previous Implicit Neural Representation (inr)-based approaches: it avoids quantization and enables direct optimization of the rate-distortion performance. However, combiner still has significant limitations: 1) it uses factorized priors and posterior approximations that lack flexibility; 2) it cannot effectively adapt to local deviations from global patterns in the data; and 3) its performance can be susceptible to modeling choices and the variational parameters’ initializations. Our proposed method, Robust and Enhanced combiner (recombiner), addresses these issues by 1) enriching the variational approximation while retaining a low computational cost via a linear reparameterization of the inr weights, 2) augmenting our inrs with learnable positional encodings that enable them to adapt to local details and 3) splitting high-resolution data into patches to increase robustness and utilizing expressive hierarchical priors to capture dependency across patches. We conduct extensive experiments across several data modalities, showcasing that recombiner achieves competitive results with the best inr-based methods and even outperforms autoencoder-based codecs on low-resolution images at low bitrates. Our PyTorch implementation is available at https://github.com/cambridge-mlg/RECOMBINER/.

1 Introduction

Advances in deep learning recently enabled a new data compression technique impossible with classical approaches: we train a neural network to memorize the data (Stanley, 2007) and then encode the network’s weights instead. These networks are called the implicit neural representation (inr) of the data, and differ from neural networks used elsewhere in three significant ways. First, they treat data as a signal that maps from coordinates to values, such as map** (X,Y)𝑋𝑌(X,Y)( italic_X , italic_Y ) pixel coordinates to (R,G,B)𝑅𝐺𝐵(R,G,B)( italic_R , italic_G , italic_B ) color triplets in the case of an image. Second, their architecture consists of many fewer layers and units than usual and tends to utilize siren activations (Sitzmann et al., 2020). Third, we aim to overfit them to the data as much as possible.

Unfortunately, most inr-based data compression methods cannot directly and jointly optimize rate-distortion, which results in a wasteful allocation of bits leading to suboptimal coding performance. COMpression with Bayesian Implicit NEural Representations (combiner; Guo et al., 2023) addresses this issue by picking a variational Gaussian mean-field Bayesian neural network (Blundell et al., 2015) as the inr of the data. This choice enables joint rate-distortion optimization via maximizing the inr’s β𝛽\betaitalic_β-evidence lower bound (β𝛽\betaitalic_β-ELBO), where β𝛽\betaitalic_β controls the rate-distortion trade-off. Finally, the authors encode a weight sample from the inr’s variational weight posterior to represent the data using relative entropy coding (REC; Havasi et al., 2018; Flamich et al., 2020).

Refer to caption
Figure 1: Schematic of (a) combiner and (b) recombiner, our proposed method. See Sections 2 and 3 for notation. As the inr’s input, recombiner uses 𝐡𝐳subscript𝐡𝐳{\mathbf{h}}_{\mathbf{z}}bold_h start_POSTSUBSCRIPT bold_z end_POSTSUBSCRIPT upsampled to pixel-wise positional encodings concatenated with Fourier embeddings. (c) A closer look at how recombiner maps 𝐡𝐳subscript𝐡𝐳{\mathbf{h}}_{\mathbf{z}}bold_h start_POSTSUBSCRIPT bold_z end_POSTSUBSCRIPT to the inr input, taking images as an example. FE: Fourier embeddings; FC: fully connected layer.

Although combiner performs strongly among inr-based approaches, it falls short of the state-of-the-art codecs on well-established data modalities both in terms of performance and robustness. In this paper, we identify several issues that lead to this discrepancy: 1) combiner employs a fully-factorized Gaussian variational posterior over the inr weights, which tends to underfit the data (Dusenberry et al., 2020), going directly against our goal of overfitting; 2) Overfitting small inrs used by combiner is challenging, especially at low bitrates: a small change to any weight can significantly affect the reconstruction at every coordinate, hence optimization by stochastic gradient descent becomes unstable and yields suboptimal results. 3) Overfitting becomes more problematic on high-resolution signals. As highlighted by Guo et al. (2023), the method is sensitive to model choices and the variational parameters’ initialization and requires considerable effort to tune.

We tackle these problems by proposing several non-trivial extensions to combiner, which significantly improve the rate-distortion performance and robustness to modeling choices. Hence, we dub our method robust and enhanced combiner (recombiner). Concretely, our contributions are:

  • We propose a simple yet effective learned reparameterization for neural network weights specifically tailored for inr-based compression, yielding more expressive variational posteriors while matching the computational cost of standard mean-field variational inference.

  • We augment our inr with learnable positional encodings whose parameters only have a local influence on the reconstructed signal, thus allowing deviations from the global patterns captured by the network weights, facilitating overfitting the inr with gradient descent.

  • We split high-resolution data into patches to improve robustness to modeling choices and the variational parameters’ initialization. Moreover, we propose an expressive hierarchical Bayesian model to capture the dependencies across patches to enhance performance.

  • We conduct extensive experiments to verify the effectiveness of our proposed extensions across several data modalities, including image, audio, video and protein structure data. In particular, we show that recombiner achieves better rate-distortion performance than VAE-based approaches on low-resolution images at low bitrates.

2 Background

This section reviews the essential parts of Guo et al. (2023)’s compression with Bayesian implicit neural representations (combiner), as it provides the basis for our method.

Variational Bayesian Implicit Neural Representations: We assume the data we wish to compress can be represented as a continuous function f:𝙸𝙾:𝑓superscript𝙸superscript𝙾f:\mathbb{R}^{\mathtt{I}}\to\mathbb{R}^{\mathtt{O}}italic_f : blackboard_R start_POSTSUPERSCRIPT typewriter_I end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT typewriter_O end_POSTSUPERSCRIPT from 𝙸𝙸\mathtt{I}typewriter_I-dimensional coordinates to 𝙾𝙾\mathtt{O}typewriter_O-dimensional signal values. Then, our goal is to approximate f𝑓fitalic_f with a small neural network g(𝐰)g(\cdot\mid{\mathbf{w}})italic_g ( ⋅ ∣ bold_w ) with weights 𝐰𝐰{\mathbf{w}}bold_w. Given L𝐿Litalic_L hidden layers in the network, we write 𝐰=[𝐰[1],,𝐰[L]]𝐰superscript𝐰delimited-[]1superscript𝐰delimited-[]𝐿{\mathbf{w}}=[{\mathbf{w}}^{[1]},\ldots,{\mathbf{w}}^{[L]}]bold_w = [ bold_w start_POSTSUPERSCRIPT [ 1 ] end_POSTSUPERSCRIPT , … , bold_w start_POSTSUPERSCRIPT [ italic_L ] end_POSTSUPERSCRIPT ], which represents the concatenation of the L𝐿Litalic_L weight matrices 𝐰[1],𝐰[L]superscript𝐰delimited-[]1superscript𝐰delimited-[]𝐿{\mathbf{w}}^{[1]},\ldots{\mathbf{w}}^{[L]}bold_w start_POSTSUPERSCRIPT [ 1 ] end_POSTSUPERSCRIPT , … bold_w start_POSTSUPERSCRIPT [ italic_L ] end_POSTSUPERSCRIPT, each flattened into a row-vector. Guo et al. (2023) propose using variational Bayesian neural networks (BNN; Blundell et al., 2015) that place a prior p𝐰subscript𝑝𝐰p_{\mathbf{w}}italic_p start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT and a variational posterior q𝐰subscript𝑞𝐰q_{\mathbf{w}}italic_q start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT on the weights. Furthermore, they use Fourier embeddings γ(𝐱)𝛾𝐱\gamma({\mathbf{x}})italic_γ ( bold_x ) for the input data (Tancik et al., 2020) and sine activations at the hidden layers (Sitzmann et al., 2020). To infer the implicit neural representation (inr) for some data 𝒟𝒟\mathcal{D}caligraphic_D, we treat 𝒟𝒟\mathcal{D}caligraphic_D as a dataset of coordinate-value pairs {(𝐱i,𝐲i)}i=1Dsuperscriptsubscriptsubscript𝐱𝑖subscript𝐲𝑖𝑖1𝐷\{({\mathbf{x}}_{i},{\mathbf{y}}_{i})\}_{i=1}^{D}{ ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT, e.g. for an image, 𝐱isubscript𝐱𝑖{\mathbf{x}}_{i}bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT can be an (X,Y)𝑋𝑌(X,Y)( italic_X , italic_Y ) pixel coordinate and 𝐲isubscript𝐲𝑖{\mathbf{y}}_{i}bold_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT the corresponding (R,G,B)𝑅𝐺𝐵(R,G,B)( italic_R , italic_G , italic_B ) triplet. Next, we pick a distortion metric ΔΔ\Deltaroman_Δ (e.g., mean squared error) and a trade-off parameter β𝛽\betaitalic_β to define the β𝛽\betaitalic_β-rate-distortion objective:

(𝒟,q𝐰,p𝐰,β)=βDKL[q𝐰p𝐰]+1Di=1D𝔼q𝐰[Δ(𝐲i,g(𝐱i𝐰)],\displaystyle\mathcal{L}(\mathcal{D},q_{\mathbf{w}},p_{\mathbf{w}},\beta)=% \beta\cdot D_{\mathrm{KL}}[q_{\mathbf{w}}\|p_{\mathbf{w}}]+\frac{1}{D}\sum_{i=% 1}^{D}\mathbb{E}_{q_{\mathbf{w}}}\left[\Delta({\mathbf{y}}_{i},g({\mathbf{x}}_% {i}\mid{\mathbf{w}})\right],caligraphic_L ( caligraphic_D , italic_q start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT , italic_β ) = italic_β ⋅ italic_D start_POSTSUBSCRIPT roman_KL end_POSTSUBSCRIPT [ italic_q start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT ∥ italic_p start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT ] + divide start_ARG 1 end_ARG start_ARG italic_D end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ roman_Δ ( bold_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_g ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ bold_w ) ] , (1)

where DKL[q𝐰p𝐰]subscript𝐷KLdelimited-[]conditionalsubscript𝑞𝐰subscript𝑝𝐰D_{\mathrm{KL}}[q_{\mathbf{w}}\|p_{\mathbf{w}}]italic_D start_POSTSUBSCRIPT roman_KL end_POSTSUBSCRIPT [ italic_q start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT ∥ italic_p start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT ] denotes the Kullback-Leibler divergence of q𝐰subscript𝑞𝐰q_{\mathbf{w}}italic_q start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT from p𝐰subscript𝑝𝐰p_{\mathbf{w}}italic_p start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT, and as we explain below, it represents the compression rate of a single weight sample 𝐰q𝐰similar-to𝐰subscript𝑞𝐰{\mathbf{w}}\sim q_{\mathbf{w}}bold_w ∼ italic_q start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT. Note that Equation 1 corresponds to a negative β𝛽\betaitalic_β-evidence lower bound under mild assumptions on ΔΔ\Deltaroman_Δ.

We infer the optimal posterior by computing q𝐰*=argminq𝐰𝒬(𝒟,q𝐰,p𝐰,β)subscriptsuperscript𝑞𝐰subscriptargminsubscript𝑞𝐰𝒬𝒟subscript𝑞𝐰subscript𝑝𝐰𝛽{q^{*}_{\mathbf{w}}=\operatorname*{arg\,min}_{q_{\mathbf{w}}\in\mathcal{Q}}% \mathcal{L}(\mathcal{D},q_{\mathbf{w}},p_{\mathbf{w}},\beta)}italic_q start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT = start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT ∈ caligraphic_Q end_POSTSUBSCRIPT caligraphic_L ( caligraphic_D , italic_q start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT , italic_β ) over an appropriate variational family 𝒬𝒬\mathcal{Q}caligraphic_Q. Guo et al. (2023) set 𝒬𝒬\mathcal{Q}caligraphic_Q to be the family of factorized Gaussian distributions.

Training combiner: Once we selected a network architecture g𝑔gitalic_g for our inrs, a crucial element of combiner is to select a good prior on the weights p𝐰subscript𝑝𝐰p_{\mathbf{w}}italic_p start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT. Given a training set {𝒟1,,𝒟M}subscript𝒟1subscript𝒟𝑀\{\mathcal{D}_{1},\ldots,\mathcal{D}_{M}\}{ caligraphic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , caligraphic_D start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT } and an initial guess for p𝐰subscript𝑝𝐰p_{\mathbf{w}}italic_p start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT, Guo et al. (2023) propose the following iterative scheme to select the optimal prior: 1) Fix p𝐰subscript𝑝𝐰p_{\mathbf{w}}italic_p start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT and infer the variational inr posteriors q𝐰,m*subscriptsuperscript𝑞𝐰𝑚q^{*}_{{\mathbf{w}},m}italic_q start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_w , italic_m end_POSTSUBSCRIPT for each datum 𝒟msubscript𝒟𝑚\mathcal{D}_{m}caligraphic_D start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT by minimizng Equation 1; 2) Fix the q𝐰,m*subscriptsuperscript𝑞𝐰𝑚q^{*}_{{\mathbf{w}},m}italic_q start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_w , italic_m end_POSTSUBSCRIPTs and update the prior parameters p𝐰subscript𝑝𝐰p_{\mathbf{w}}italic_p start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT based on the parameters of the posteriors. When the q𝐰subscript𝑞𝐰q_{\mathbf{w}}italic_q start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT are Gaussian, Guo et al. (2023) derive analytic formulae for updating the prior parameters. To avoid overloading the notion of training, we refer to learning p𝐰subscript𝑝𝐰p_{\mathbf{w}}italic_p start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT and the other model parameters as training, and to learning q𝐰subscript𝑞𝐰q_{\mathbf{w}}italic_q start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT as inferring the inr.

Compressing data with combiner: Once we picked the inr architecture g𝑔gitalic_g and found the optimal prior p𝐰subscript𝑝𝐰p_{\mathbf{w}}italic_p start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT, we can use combiner to compress new data 𝒟𝒟\mathcal{D}caligraphic_D in two steps: 1) We first infer the variational inr posterior q𝐰subscript𝑞𝐰q_{\mathbf{w}}italic_q start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT for 𝒟𝒟\mathcal{D}caligraphic_D by optimizing Equation 1, after which 2) we encode an approximate sample from q𝐰subscript𝑞𝐰q_{\mathbf{w}}italic_q start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT using relative entropy coding (REC), whose expected coding cost is approximately DKL[q𝐰p𝐰]subscript𝐷KLdelimited-[]conditionalsubscript𝑞𝐰subscript𝑝𝐰D_{\mathrm{KL}}[q_{\mathbf{w}}\|p_{\mathbf{w}}]italic_D start_POSTSUBSCRIPT roman_KL end_POSTSUBSCRIPT [ italic_q start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT ∥ italic_p start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT ] (Havasi et al., 2018; Flamich et al., 2020). Following Guo et al. (2023), we used depth-limited global-bound A*{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT coding (Flamich et al., 2022), to which we will refer as just A*{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT coding. Unfortunately, applying A*{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT coding to encode a sample from q𝐰subscript𝑞𝐰q_{\mathbf{w}}italic_q start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT is infeasible in practice, as the time complexity of the algorithm grows as Ω(exp(DKL[q𝐰p𝐰]))Ωsubscript𝐷KLdelimited-[]conditionalsubscript𝑞𝐰subscript𝑝𝐰\Omega(\exp(D_{\mathrm{KL}}[q_{\mathbf{w}}\|p_{\mathbf{w}}]))roman_Ω ( roman_exp ( italic_D start_POSTSUBSCRIPT roman_KL end_POSTSUBSCRIPT [ italic_q start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT ∥ italic_p start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT ] ) ). Hence, Guo et al. (2023) suggest breaking up the problem into smaller ones. First, they draw a uniformly random permutation α𝛼\alphaitalic_α on dim(𝐰)dimension𝐰\dim({\mathbf{w}})roman_dim ( bold_w ) elements, and use it to permute the dimensions of 𝐰𝐰{\mathbf{w}}bold_w as α(𝐰)=[𝐰α(1),,𝐰α(dim(𝐰))]𝛼𝐰subscript𝐰𝛼1subscript𝐰𝛼dimension𝐰\alpha({\mathbf{w}})=[{\mathbf{w}}_{\alpha(1)},\ldots,{\mathbf{w}}_{\alpha(% \dim({\mathbf{w}}))}]italic_α ( bold_w ) = [ bold_w start_POSTSUBSCRIPT italic_α ( 1 ) end_POSTSUBSCRIPT , … , bold_w start_POSTSUBSCRIPT italic_α ( roman_dim ( bold_w ) ) end_POSTSUBSCRIPT ]. Then, they partition α(𝐰)𝛼𝐰\alpha({\mathbf{w}})italic_α ( bold_w ) into smaller blocks, and compress the blocks sequentially. Permuting the weight vector ensures that the KL divergences are spread approximately evenly across the blocks. As an additional technical note, between compressing each block, we run a few steps of finetuning the posterior of the weights that are yet to be compressed, see Guo et al. (2023) for more details.

3 Methods

In this section, we propose several extensions to Guo et al. (2023)’s framework that significantly improve its robustness and performance: 1) we introduce a linear reparemeterization for the inr’s weights which yields a richer variational posterior family; 2) we augment the inr’s input with learned positional encodings to capture local features in the data and to assist overfitting; 3) we scale our method to high-resolution image compression by dividing the images into patches and introducing an expressive hierarchical Bayesian model over the patch-inrs, and 4) we introduce minor modifications to the training procedure and adaptively select β𝛽\betaitalic_β to achieve the desired coding budget. Contributions 1) and 2) are depicted in Figure 1, while 3) is shown in Figure 2.

3.1 Linear Reparameterization for the Network Parameters

A significant limitation of the factorized Gaussian variational posterior used by combiner is that it posits dimension-wise independent weights. This assumption is known to be unrealistic (Izmailov et al., 2021) and to underfit the data (Dusenberry et al., 2020), which goes directly against our goal of overfitting the data. On the other hand, using a full-covariance Gaussian posterior approximation would increase the inr’s training and coding time significantly, even for small network architectures.

Hence, we propose a solution that lies in-between: at a high level, we learn a linearly-transformed factorized Gaussian approximation that closely matches the full-covariance Gaussian posterior on average over the training data. Formally, for each layer l=1,,L𝑙1𝐿l=1,\ldots,Litalic_l = 1 , … , italic_L, we model the weights as 𝐰[l]=𝐡𝐰[l]𝑨[l]superscript𝐰delimited-[]𝑙superscriptsubscript𝐡𝐰delimited-[]𝑙superscript𝑨delimited-[]𝑙{{\mathbf{w}}^{[l]}={\mathbf{h}}_{\mathbf{w}}^{[l]}{\bm{A}}^{[l]}}bold_w start_POSTSUPERSCRIPT [ italic_l ] end_POSTSUPERSCRIPT = bold_h start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT [ italic_l ] end_POSTSUPERSCRIPT bold_italic_A start_POSTSUPERSCRIPT [ italic_l ] end_POSTSUPERSCRIPT, where the 𝑨[l]superscript𝑨delimited-[]𝑙{\bm{A}}^{[l]}bold_italic_A start_POSTSUPERSCRIPT [ italic_l ] end_POSTSUPERSCRIPT are square matrices, and we place a factorized Gaussian prior and variational posterior on 𝐡𝐰[l]superscriptsubscript𝐡𝐰delimited-[]𝑙{\mathbf{h}}_{\mathbf{w}}^{[l]}bold_h start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT [ italic_l ] end_POSTSUPERSCRIPT instead. We learn each 𝑨[l]superscript𝑨delimited-[]𝑙{\bm{A}}^{[l]}bold_italic_A start_POSTSUPERSCRIPT [ italic_l ] end_POSTSUPERSCRIPT during the training stage, after which we fix them and only infer factorized posteriors q𝐡𝐰[l]subscript𝑞superscriptsubscript𝐡𝐰delimited-[]𝑙q_{{\mathbf{h}}_{\mathbf{w}}^{[l]}}italic_q start_POSTSUBSCRIPT bold_h start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT [ italic_l ] end_POSTSUPERSCRIPT end_POSTSUBSCRIPT when compressing new data. To simplify notation, we collect the 𝑨[l]superscript𝑨delimited-[]𝑙{\bm{A}}^{[l]}bold_italic_A start_POSTSUPERSCRIPT [ italic_l ] end_POSTSUPERSCRIPT in a block-diagonal matrix 𝑨=diag(𝑨[1],,𝑨[L])𝑨diagsuperscript𝑨delimited-[]1superscript𝑨delimited-[]𝐿{\bm{A}}=\mathrm{diag}({\bm{A}}^{[1]},\ldots,{\bm{A}}^{[L]})bold_italic_A = roman_diag ( bold_italic_A start_POSTSUPERSCRIPT [ 1 ] end_POSTSUPERSCRIPT , … , bold_italic_A start_POSTSUPERSCRIPT [ italic_L ] end_POSTSUPERSCRIPT ) and the 𝐡𝐰[l]subscriptsuperscript𝐡delimited-[]𝑙𝐰{\mathbf{h}}^{[l]}_{\mathbf{w}}bold_h start_POSTSUPERSCRIPT [ italic_l ] end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT in a single row-vector 𝐡𝐰=[𝐡𝐰[1],,𝐡𝐰[L]]subscript𝐡𝐰subscriptsuperscript𝐡delimited-[]1𝐰subscriptsuperscript𝐡delimited-[]𝐿𝐰{\mathbf{h}}_{\mathbf{w}}=[{\mathbf{h}}^{[1]}_{\mathbf{w}},\ldots,{\mathbf{h}}% ^{[L]}_{\mathbf{w}}]bold_h start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT = [ bold_h start_POSTSUPERSCRIPT [ 1 ] end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT , … , bold_h start_POSTSUPERSCRIPT [ italic_L ] end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT ], so that now the weights are given by 𝐰=𝐡𝐰𝑨𝐰subscript𝐡𝐰𝑨{\mathbf{w}}={\mathbf{h}}_{\mathbf{w}}{\bm{A}}bold_w = bold_h start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT bold_italic_A. We found this layer-wise weight reparameterization as efficient as using a joint one for the entire weight vector 𝐰𝐰{\mathbf{w}}bold_w. Hence, we use the layer-wise approach, as it is more parameter and compute-efficient.

This simple yet expressive variational approximation has a couple of advantages. First, it provides an expressive full-covariance prior and posterior while requiring much less training and coding time. Specifically, the KL divergence required by Equation 1 is still between factorized Gaussians and we do not need to optimize the full covariance matrices of the posteriors during coding. Second, this parameterization has scale redundancy: for any c𝑐c\in\mathbb{R}italic_c ∈ blackboard_R we have 𝐡𝐰𝑨=(1/c𝐡𝐰)(c𝑨)subscript𝐡𝐰𝑨1𝑐subscript𝐡𝐰𝑐𝑨{\mathbf{h}}_{\mathbf{w}}{\bm{A}}=(\nicefrac{{1}}{{c}}\cdot{\mathbf{h}}_{% \mathbf{w}})(c\cdot{\bm{A}})bold_h start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT bold_italic_A = ( / start_ARG 1 end_ARG start_ARG italic_c end_ARG ⋅ bold_h start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT ) ( italic_c ⋅ bold_italic_A ). Hence, if we initialize 𝐡𝐰subscript𝐡𝐰{\mathbf{h}}_{\mathbf{w}}bold_h start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT suboptimally during training, 𝑨𝑨{\bm{A}}bold_italic_A can still learn to compensate for it, making our method more robust. Finally, note that this reparameterization is specifically tailored for inr-based compression and would usually not be feasible in other BNN use-cases, since we learn 𝑨𝑨{\bm{A}}bold_italic_A while inferring multiple variational posteriors simultaneously.

3.2 Learned Positional Encodings

A challenge for overfitting inrs, especially at low bitrates is their global representation of the data, in the sense that each of their weights influences the reconstruction at every coordinate. To mitigate this issue, we extend our inrs to take a learned positional input 𝐳isubscript𝐳𝑖{\mathbf{z}}_{i}bold_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT at each coordinate 𝐱isubscript𝐱𝑖{\mathbf{x}}_{i}bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT: g(𝐱i,𝐳i𝐰)𝑔subscript𝐱𝑖conditionalsubscript𝐳𝑖𝐰g({\mathbf{x}}_{i},{\mathbf{z}}_{i}\mid{\mathbf{w}})italic_g ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ bold_w ).

However, it is usually wasteful to introduce a vector for each coordinate in practice. Instead, we use a lower-dimensional row-vector representation 𝐡𝐳subscript𝐡𝐳{\mathbf{h}}_{\mathbf{z}}bold_h start_POSTSUBSCRIPT bold_z end_POSTSUBSCRIPT, that we reshape and upsample with a learnable function ϕitalic-ϕ\phiitalic_ϕ. In the case of a W×H𝑊𝐻W\times Hitalic_W × italic_H image with F𝐹Fitalic_F-dimensional positional encodings, we could pick 𝐡𝐳subscript𝐡𝐳{\mathbf{h}}_{\mathbf{z}}bold_h start_POSTSUBSCRIPT bold_z end_POSTSUBSCRIPT such that dim(𝐡𝐳)FWHmuch-less-thandimensionsubscript𝐡𝐳𝐹𝑊𝐻\dim({\mathbf{h}}_{\mathbf{z}})\ll F\cdot W\cdot Hroman_dim ( bold_h start_POSTSUBSCRIPT bold_z end_POSTSUBSCRIPT ) ≪ italic_F ⋅ italic_W ⋅ italic_H, then reshape and upsample it to be F×W×H𝐹𝑊𝐻F\times W\times Hitalic_F × italic_W × italic_H by picking ϕitalic-ϕ\phiitalic_ϕ to be some small convolutional network. Then, we set 𝐳i=ϕ(𝐡𝐳)𝐱isubscript𝐳𝑖italic-ϕsubscriptsubscript𝐡𝐳subscript𝐱𝑖{\mathbf{z}}_{i}=\phi({\mathbf{h}}_{\mathbf{z}})_{{\mathbf{x}}_{i}}bold_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_ϕ ( bold_h start_POSTSUBSCRIPT bold_z end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT to be the positional encoding at location 𝐱isubscript𝐱𝑖{\mathbf{x}}_{i}bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. We placed a factorized Gaussian prior and variational posterior on 𝐡𝐳subscript𝐡𝐳{\mathbf{h}}_{\mathbf{z}}bold_h start_POSTSUBSCRIPT bold_z end_POSTSUBSCRIPT. Hereafter, we refer to 𝐡𝐳subscript𝐡𝐳{\mathbf{h}}_{\mathbf{z}}bold_h start_POSTSUBSCRIPT bold_z end_POSTSUBSCRIPT as the latent positional encodings, ϕ(𝐡𝐳)italic-ϕsubscript𝐡𝐳\phi({\mathbf{h}}_{\mathbf{z}})italic_ϕ ( bold_h start_POSTSUBSCRIPT bold_z end_POSTSUBSCRIPT ) and 𝐳isubscript𝐳𝑖{\mathbf{z}}_{i}bold_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT as the upsampled positional encodings.

3.3 Scaling To High-Resolution Data with Patches

With considerable effort, Guo et al. (2023) successfully scaled combiner to high-resolution images by significantly increasing the number of inr parameters. However, they note that the training procedure was very sensitive to hyperparameters, including the initialization of variational parameters and model size selection. Unfortunately, improving the robustness of large inrs using the weight reparameterization we describe in Section 3.1 is also impractical, because the size of the transformation matrix 𝑨𝑨{\bm{A}}bold_italic_A grows quadratically in the number of weights. Therefore, we split high-resolution data into patches and infer a separate small inr for each patch, in line with other inr-based works as well (Dupont et al., 2022; Schwarz & Teh, 2022; Schwarz et al., 2023). However, the patches’ inrs are independent by default, hence we re-introduce information sharing between the patch-inrs’ weights via a hierarchical model for 𝐡𝐰subscript𝐡𝐰{\mathbf{h}}_{\mathbf{w}}bold_h start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT. Finally, we take advantage of the patch structure to parallelize data compression and reduce the encoding time in recombiner, as discussed at the end of this section.

recombiner’s hierarchical Bayesian model: We posit a global representation for the weights 𝐡¯𝐰subscript¯𝐡𝐰\overline{{\mathbf{h}}}_{\mathbf{w}}over¯ start_ARG bold_h end_ARG start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT, from which each patch-inr can deviate. Thus, assuming that the data 𝒟𝒟\mathcal{D}caligraphic_D is split into P𝑃Pitalic_P patches, for each patch π1,,P𝜋1𝑃\pi\in 1,\ldots,Pitalic_π ∈ 1 , … , italic_P, we need to define the conditional distributions of patch representations 𝐡𝐰(π)𝐡¯𝐰conditionalsuperscriptsubscript𝐡𝐰𝜋subscript¯𝐡𝐰{\mathbf{h}}_{\mathbf{w}}^{(\pi)}\mid\overline{{\mathbf{h}}}_{\mathbf{w}}bold_h start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_π ) end_POSTSUPERSCRIPT ∣ over¯ start_ARG bold_h end_ARG start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT. However, since we wish to model deviations from the global representation, it is natural to decompose the patch representation as 𝐡𝐰(π)=Δ𝐡𝐰(π)+𝐡¯𝐰superscriptsubscript𝐡𝐰𝜋Δsubscriptsuperscript𝐡𝜋𝐰subscript¯𝐡𝐰{\mathbf{h}}_{\mathbf{w}}^{(\pi)}=\Delta{\mathbf{h}}^{(\pi)}_{\mathbf{w}}+% \overline{{\mathbf{h}}}_{\mathbf{w}}bold_h start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_π ) end_POSTSUPERSCRIPT = roman_Δ bold_h start_POSTSUPERSCRIPT ( italic_π ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT + over¯ start_ARG bold_h end_ARG start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT, and specify the conditional distribution of the differences Δ𝐡𝐰(π)𝐡¯𝐰conditionalΔsuperscriptsubscript𝐡𝐰𝜋subscript¯𝐡𝐰\Delta{\mathbf{h}}_{\mathbf{w}}^{(\pi)}\mid\overline{{\mathbf{h}}}_{\mathbf{w}}roman_Δ bold_h start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_π ) end_POSTSUPERSCRIPT ∣ over¯ start_ARG bold_h end_ARG start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT instead, without any loss of generality. In this paper, we place a factorized Gaussian prior and variational posterior on the joint distribution of the global representation and the deviations, given by the following product of P+1𝑃1P+1italic_P + 1 Gaussian measures:

p𝐡¯𝐰,Δ𝐡𝐰(1:P)subscript𝑝subscript¯𝐡𝐰Δsuperscriptsubscript𝐡𝐰:1𝑃\displaystyle p_{\,\overline{{\mathbf{h}}}_{\mathbf{w}},\Delta{\mathbf{h}}_{% \mathbf{w}}^{(1:P)}}italic_p start_POSTSUBSCRIPT over¯ start_ARG bold_h end_ARG start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT , roman_Δ bold_h start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 : italic_P ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT =𝒩(𝝁¯𝐰,diag(𝝈¯𝐰))×π=1P𝒩(𝝁Δ(π),diag(𝝈Δ(π)))absent𝒩subscript¯𝝁𝐰diagsubscript¯𝝈𝐰superscriptsubscriptproduct𝜋1𝑃𝒩superscriptsubscript𝝁Δ𝜋diagsuperscriptsubscript𝝈Δ𝜋\displaystyle=\mathcal{N}(\mathbf{\overline{\bm{\mu}}}_{{\mathbf{w}}},\mathrm{% diag}(\mathbf{\overline{\bm{\sigma}}}_{{\mathbf{w}}}))\times\prod_{\pi=1}^{P}% \mathcal{N}(\bm{\mu}_{\Delta}^{(\pi)},\mathrm{diag}(\bm{\sigma}_{\Delta}^{(\pi% )}))= caligraphic_N ( over¯ start_ARG bold_italic_μ end_ARG start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT , roman_diag ( over¯ start_ARG bold_italic_σ end_ARG start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT ) ) × ∏ start_POSTSUBSCRIPT italic_π = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT caligraphic_N ( bold_italic_μ start_POSTSUBSCRIPT roman_Δ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_π ) end_POSTSUPERSCRIPT , roman_diag ( bold_italic_σ start_POSTSUBSCRIPT roman_Δ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_π ) end_POSTSUPERSCRIPT ) ) (2)
q𝐡¯𝐰,Δ𝐡𝐰(1:P)subscript𝑞subscript¯𝐡𝐰Δsuperscriptsubscript𝐡𝐰:1𝑃\displaystyle q_{\,\overline{{\mathbf{h}}}_{\mathbf{w}},\Delta{\mathbf{h}}_{% \mathbf{w}}^{(1:P)}}italic_q start_POSTSUBSCRIPT over¯ start_ARG bold_h end_ARG start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT , roman_Δ bold_h start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 : italic_P ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT =𝒩(𝝂¯𝐰,diag(𝝆¯𝐰))×π=1P𝒩(𝝂Δ(π),diag(𝝆Δ(π))),absent𝒩subscript¯𝝂𝐰diagsubscript¯𝝆𝐰superscriptsubscriptproduct𝜋1𝑃𝒩superscriptsubscript𝝂Δ𝜋diagsuperscriptsubscript𝝆Δ𝜋\displaystyle=\mathcal{N}(\mathbf{\overline{\bm{\nu}}}_{{\mathbf{w}}},\mathrm{% diag}(\mathbf{\overline{\bm{\rho}}}_{{\mathbf{w}}}))\times\prod_{\pi=1}^{P}% \mathcal{N}(\bm{\nu}_{\Delta}^{(\pi)},\mathrm{diag}(\bm{\rho}_{\Delta}^{(\pi)}% )),= caligraphic_N ( over¯ start_ARG bold_italic_ν end_ARG start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT , roman_diag ( over¯ start_ARG bold_italic_ρ end_ARG start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT ) ) × ∏ start_POSTSUBSCRIPT italic_π = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT caligraphic_N ( bold_italic_ν start_POSTSUBSCRIPT roman_Δ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_π ) end_POSTSUPERSCRIPT , roman_diag ( bold_italic_ρ start_POSTSUBSCRIPT roman_Δ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_π ) end_POSTSUPERSCRIPT ) ) , (3)
Refer to caption
Figure 2: Illustration of (a) the three-level hierarchical model and (b) our permutation strategy.

where 1:P:1𝑃1:P1 : italic_P is the slice notation, i.e. Δ𝐡𝐰(1:P)=Δ𝐡𝐰(1),,Δ𝐡𝐰(P)Δsuperscriptsubscript𝐡𝐰:1𝑃Δsuperscriptsubscript𝐡𝐰1Δsuperscriptsubscript𝐡𝐰𝑃\Delta{\mathbf{h}}_{\mathbf{w}}^{(1:P)}=\Delta{\mathbf{h}}_{\mathbf{w}}^{(1)},% \ldots,\Delta{\mathbf{h}}_{\mathbf{w}}^{(P)}roman_Δ bold_h start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 : italic_P ) end_POSTSUPERSCRIPT = roman_Δ bold_h start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT , … , roman_Δ bold_h start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_P ) end_POSTSUPERSCRIPT. Importantly, while the posterior approximation in Equation 3 assumes that the global representation and the differences are independent, 𝐡¯𝐰subscript¯𝐡𝐰\overline{{\mathbf{h}}}_{\mathbf{w}}over¯ start_ARG bold_h end_ARG start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT and 𝐡𝐰(π)superscriptsubscript𝐡𝐰𝜋{\mathbf{h}}_{\mathbf{w}}^{(\pi)}bold_h start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_π ) end_POSTSUPERSCRIPT remain correlated. Note that optimizing Equation 1 requires us to compute DKL[q𝐡𝐰(1:P)p𝐡𝐰(1:P)]subscript𝐷KLdelimited-[]conditionalsubscript𝑞superscriptsubscript𝐡𝐰:1𝑃subscript𝑝superscriptsubscript𝐡𝐰:1𝑃D_{\mathrm{KL}}[q_{{\mathbf{h}}_{\mathbf{w}}^{(1:P)}}\|p_{{\mathbf{h}}_{% \mathbf{w}}^{(1:P)}}]italic_D start_POSTSUBSCRIPT roman_KL end_POSTSUBSCRIPT [ italic_q start_POSTSUBSCRIPT bold_h start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 : italic_P ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∥ italic_p start_POSTSUBSCRIPT bold_h start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 : italic_P ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ]. Unfortunately, due to the complex dependence between the 𝐡𝐰(π)superscriptsubscript𝐡𝐰𝜋{\mathbf{h}}_{\mathbf{w}}^{(\pi)}bold_h start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_π ) end_POSTSUPERSCRIPTs, this calculation is infeasible. Instead, we can minimize an upper bound to it by observing that

DKL[q𝐡𝐰(1:P)p𝐡𝐰(1:P)]subscript𝐷KLdelimited-[]conditionalsubscript𝑞superscriptsubscript𝐡𝐰:1𝑃subscript𝑝superscriptsubscript𝐡𝐰:1𝑃\displaystyle D_{\mathrm{KL}}[q_{{\mathbf{h}}_{\mathbf{w}}^{(1:P)}}\|p_{{% \mathbf{h}}_{\mathbf{w}}^{(1:P)}}]italic_D start_POSTSUBSCRIPT roman_KL end_POSTSUBSCRIPT [ italic_q start_POSTSUBSCRIPT bold_h start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 : italic_P ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∥ italic_p start_POSTSUBSCRIPT bold_h start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 : italic_P ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ] DKL[q𝐡𝐰(1:P)p𝐡𝐰(1:P)]+DKL[q𝐡¯𝐰𝐡𝐰(1:P)p𝐡¯𝐰𝐡𝐰(1:P)]absentsubscript𝐷KLdelimited-[]conditionalsubscript𝑞superscriptsubscript𝐡𝐰:1𝑃subscript𝑝superscriptsubscript𝐡𝐰:1𝑃subscript𝐷KLdelimited-[]conditionalsubscript𝑞conditionalsubscript¯𝐡𝐰superscriptsubscript𝐡𝐰:1𝑃subscript𝑝conditionalsubscript¯𝐡𝐰superscriptsubscript𝐡𝐰:1𝑃\displaystyle\leq D_{\mathrm{KL}}[q_{{\mathbf{h}}_{\mathbf{w}}^{(1:P)}}\|p_{{% \mathbf{h}}_{\mathbf{w}}^{(1:P)}}]+D_{\mathrm{KL}}[q_{\,\overline{{\mathbf{h}}% }_{\mathbf{w}}\mid{\mathbf{h}}_{\mathbf{w}}^{(1:P)}}\|p_{\,\overline{{\mathbf{% h}}}_{\mathbf{w}}\mid{\mathbf{h}}_{\mathbf{w}}^{(1:P)}}]≤ italic_D start_POSTSUBSCRIPT roman_KL end_POSTSUBSCRIPT [ italic_q start_POSTSUBSCRIPT bold_h start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 : italic_P ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∥ italic_p start_POSTSUBSCRIPT bold_h start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 : italic_P ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ] + italic_D start_POSTSUBSCRIPT roman_KL end_POSTSUBSCRIPT [ italic_q start_POSTSUBSCRIPT over¯ start_ARG bold_h end_ARG start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT ∣ bold_h start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 : italic_P ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∥ italic_p start_POSTSUBSCRIPT over¯ start_ARG bold_h end_ARG start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT ∣ bold_h start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 : italic_P ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ]
=DKL[q𝐡¯𝐰,𝐡𝐰(1:P)p𝐡¯𝐰,𝐡𝐰(1:P)]absentsubscript𝐷KLdelimited-[]conditionalsubscript𝑞subscript¯𝐡𝐰superscriptsubscript𝐡𝐰:1𝑃subscript𝑝subscript¯𝐡𝐰superscriptsubscript𝐡𝐰:1𝑃\displaystyle=D_{\mathrm{KL}}[q_{\,\overline{{\mathbf{h}}}_{\mathbf{w}},{% \mathbf{h}}_{\mathbf{w}}^{(1:P)}}\|p_{\,\overline{{\mathbf{h}}}_{\mathbf{w}},{% \mathbf{h}}_{\mathbf{w}}^{(1:P)}}]= italic_D start_POSTSUBSCRIPT roman_KL end_POSTSUBSCRIPT [ italic_q start_POSTSUBSCRIPT over¯ start_ARG bold_h end_ARG start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT , bold_h start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 : italic_P ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∥ italic_p start_POSTSUBSCRIPT over¯ start_ARG bold_h end_ARG start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT , bold_h start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 : italic_P ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ]
=DKL[q𝐡¯𝐰,Δ𝐡𝐰(1:P)p𝐡¯𝐰,Δ𝐡𝐰(1:P)].absentsubscript𝐷KLdelimited-[]conditionalsubscript𝑞subscript¯𝐡𝐰Δsuperscriptsubscript𝐡𝐰:1𝑃subscript𝑝subscript¯𝐡𝐰Δsuperscriptsubscript𝐡𝐰:1𝑃\displaystyle=D_{\mathrm{KL}}[q_{\,\overline{{\mathbf{h}}}_{\mathbf{w}},\Delta% {\mathbf{h}}_{\mathbf{w}}^{(1:P)}}\|p_{\,\overline{{\mathbf{h}}}_{\mathbf{w}},% \Delta{\mathbf{h}}_{\mathbf{w}}^{(1:P)}}].= italic_D start_POSTSUBSCRIPT roman_KL end_POSTSUBSCRIPT [ italic_q start_POSTSUBSCRIPT over¯ start_ARG bold_h end_ARG start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT , roman_Δ bold_h start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 : italic_P ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∥ italic_p start_POSTSUBSCRIPT over¯ start_ARG bold_h end_ARG start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT , roman_Δ bold_h start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 : italic_P ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ] . (4)

Hence, when training the patch-inrs, we replace the KL term in Equation 1 with the divergence in Equation 4, which is between factorized Gaussian distributions and cheap to compute. Finally, we remark that we can view 𝐡¯𝐰subscript¯𝐡𝐰\overline{{\mathbf{h}}}_{\mathbf{w}}over¯ start_ARG bold_h end_ARG start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT as side information also prevalent in other neural compression codecs (Ballé et al., 2018), or auxiliary latent variables enabling factorization (Koller & Friedman, 2009).

While Equations 2 and 3 describe a two-level hierarchical model, we can easily extend the hierarchical structure by breaking up patches further into sub-patches and adding extra levels to the probabilistic model. For our experiments on high-resolution audio, images, and video, we found that a three-level hierarchical model worked best, with global weight representation 𝐡¯¯𝐰subscript¯¯𝐡𝐰{\scriptstyle\overline{\overline{{\mathbf{h}}}}_{\mathbf{w}}}over¯ start_ARG over¯ start_ARG bold_h end_ARG end_ARG start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT, second/group-level representations 𝐡¯𝐰(1:G)superscriptsubscript¯𝐡𝐰:1𝐺{\scriptstyle\overline{{\mathbf{h}}}_{\mathbf{w}}^{(1:G)}}over¯ start_ARG bold_h end_ARG start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 : italic_G ) end_POSTSUPERSCRIPT and third/patch-level representations 𝐡𝐰(1:P)superscriptsubscript𝐡𝐰:1𝑃{\scriptstyle{\mathbf{h}}_{\mathbf{w}}^{(1:P)}}bold_h start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 : italic_P ) end_POSTSUPERSCRIPT, illustrated in Figure 2a. Empirically, a hierarchical model for 𝐡𝐳subscript𝐡𝐳{\mathbf{h}}_{\mathbf{z}}bold_h start_POSTSUBSCRIPT bold_z end_POSTSUBSCRIPT did not yield significant gains, thus we only use it for 𝐡𝐰subscript𝐡𝐰{\mathbf{h}}_{\mathbf{w}}bold_h start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT.

Compressing high-resolution data with recombiner: An advantage of patching is that we can compress and fine-tune inrs and latent positional encodings of all patches in parallel. Unfortunately, compressing P𝑃Pitalic_P patches in parallel using combiner’s procedure is suboptimal, since the information content between patches might vary significantly. However, by carefully permuting the weights across the patches’ representations we can 1) adaptively allocate bits to each patch to compensate for the differences in their information content and 2) enforce the same coding budget across each parallel thread to ensure consistent coding times. Concretely, we stack representations of each patch in a matrix at each level of the hierarchical model. For example, in our three-level model we set

𝑯π,:(0)=[𝐡𝐰(π),𝐡𝐳(π)],𝑯g,:(1)=𝐡¯𝐰(g),𝑯(2)=𝐡¯¯𝐰,formulae-sequencesubscriptsuperscript𝑯0𝜋:superscriptsubscript𝐡𝐰𝜋superscriptsubscript𝐡𝐳𝜋formulae-sequencesubscriptsuperscript𝑯1𝑔:superscriptsubscript¯𝐡𝐰𝑔superscript𝑯2subscript¯¯𝐡𝐰\displaystyle{\bm{H}}^{(0)}_{\pi,:}=[{\mathbf{h}}_{\mathbf{w}}^{(\pi)},{% \mathbf{h}}_{\mathbf{z}}^{(\pi)}],\quad{\bm{H}}^{(1)}_{g,:}=\overline{{\mathbf% {h}}}_{\mathbf{w}}^{(g)},\quad{\bm{H}}^{(2)}=\overline{\overline{{\mathbf{h}}}% }_{\mathbf{w}},bold_italic_H start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_π , : end_POSTSUBSCRIPT = [ bold_h start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_π ) end_POSTSUPERSCRIPT , bold_h start_POSTSUBSCRIPT bold_z end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_π ) end_POSTSUPERSCRIPT ] , bold_italic_H start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_g , : end_POSTSUBSCRIPT = over¯ start_ARG bold_h end_ARG start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_g ) end_POSTSUPERSCRIPT , bold_italic_H start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT = over¯ start_ARG over¯ start_ARG bold_h end_ARG end_ARG start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT , (5)

where we use slice notation to denote the i𝑖iitalic_ith row as 𝑯i,:subscript𝑯𝑖:{\bm{H}}_{i,:}bold_italic_H start_POSTSUBSCRIPT italic_i , : end_POSTSUBSCRIPT and the j𝑗jitalic_jth column as 𝑯:,jsubscript𝑯:𝑗{\bm{H}}_{:,j}bold_italic_H start_POSTSUBSCRIPT : , italic_j end_POSTSUBSCRIPT. Furthermore, let Snsubscript𝑆𝑛S_{n}italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT denote the set of permutations on n𝑛nitalic_n elements. Now, at each level \ellroman_ℓ, assume 𝑯()superscript𝑯{\bm{H}}^{(\ell)}bold_italic_H start_POSTSUPERSCRIPT ( roman_ℓ ) end_POSTSUPERSCRIPT has 𝒞subscript𝒞\mathcal{C}_{\ell}caligraphic_C start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT columns and subscript\mathcal{R}_{\ell}caligraphic_R start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT rows. We sample a single within-row permutation κ𝜅\kappaitalic_κ uniformly from S𝒞subscript𝑆subscript𝒞S_{\mathcal{C}_{\ell}}italic_S start_POSTSUBSCRIPT caligraphic_C start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT end_POSTSUBSCRIPT and for each column of 𝑯()superscript𝑯{\bm{H}}^{(\ell)}bold_italic_H start_POSTSUPERSCRIPT ( roman_ℓ ) end_POSTSUPERSCRIPT we sample an across-rows permutation αjsubscript𝛼𝑗\alpha_{j}italic_α start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT uniformly from Ssubscript𝑆subscriptS_{\mathcal{R}_{\ell}}italic_S start_POSTSUBSCRIPT caligraphic_R start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT end_POSTSUBSCRIPT elements. Then, we permute 𝑯()superscript𝑯{\bm{H}}^{(\ell)}bold_italic_H start_POSTSUPERSCRIPT ( roman_ℓ ) end_POSTSUPERSCRIPT as 𝑯~i,j()=𝑯αj(i),κ(j)()subscriptsuperscript~𝑯𝑖𝑗subscriptsuperscript𝑯subscript𝛼𝑗𝑖𝜅𝑗\widetilde{{\bm{H}}}^{(\ell)}_{i,j}={\bm{H}}^{(\ell)}_{\alpha_{j}(i),\kappa(j)}over~ start_ARG bold_italic_H end_ARG start_POSTSUPERSCRIPT ( roman_ℓ ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT = bold_italic_H start_POSTSUPERSCRIPT ( roman_ℓ ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_i ) , italic_κ ( italic_j ) end_POSTSUBSCRIPT. Finally, we split the 𝑯()superscript𝑯{\bm{H}}^{(\ell)}bold_italic_H start_POSTSUPERSCRIPT ( roman_ℓ ) end_POSTSUPERSCRIPTs into blocks row-wise, and encode and fine-tune each row in parallel. We illustrate the above procedure in Figure 2b.

3.4 Extended Training Procedure

In this section, we describe the ways in which recombiner’s training procedure deviates from combiner’s. To begin, we collect the recombiner’s representations into one vector. For non-patching cases we set 𝐡=[𝐡𝐰,𝐡𝐳]𝐡subscript𝐡𝐰subscript𝐡𝐳{{\mathbf{h}}=[{\mathbf{h}}_{\mathbf{w}},{\mathbf{h}}_{\mathbf{z}}]}bold_h = [ bold_h start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT , bold_h start_POSTSUBSCRIPT bold_z end_POSTSUBSCRIPT ], and for the patch case using the three-level hierarchical model we set 𝐡=vec([𝑯(0),𝑯(1),𝑯(2)])𝐡vecsuperscript𝑯0superscript𝑯1superscript𝑯2{{\mathbf{h}}=\mathrm{vec}([{\bm{H}}^{(0)},{\bm{H}}^{(1)},{\bm{H}}^{(2)}])}bold_h = roman_vec ( [ bold_italic_H start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT , bold_italic_H start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT , bold_italic_H start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT ] ). For simplicity, we denote the factorized Gaussian prior and variational posterior over 𝐡𝐡{\mathbf{h}}bold_h as p𝐡=𝒩(𝝁,diag(𝝈))subscript𝑝𝐡𝒩𝝁diag𝝈{p_{\mathbf{h}}=\mathcal{N}(\bm{\mu},\mathrm{diag}(\bm{\sigma}))}italic_p start_POSTSUBSCRIPT bold_h end_POSTSUBSCRIPT = caligraphic_N ( bold_italic_μ , roman_diag ( bold_italic_σ ) ) and q𝐡=𝒩(𝝂,diag(𝝆))subscript𝑞𝐡𝒩𝝂diag𝝆q_{\mathbf{h}}=\mathcal{N}(\bm{\nu},\mathrm{diag}(\bm{\rho}))italic_q start_POSTSUBSCRIPT bold_h end_POSTSUBSCRIPT = caligraphic_N ( bold_italic_ν , roman_diag ( bold_italic_ρ ) ), where 𝝁𝝁\bm{\mu}bold_italic_μ and 𝝂𝝂\bm{\nu}bold_italic_ν are the means and 𝝈𝝈\bm{\sigma}bold_italic_σ and 𝝆𝝆\bm{\rho}bold_italic_ρ are the diagonals of covariances of the prior and the posterior, respectively.

Training recombiner: Our objective for the training stage is to obtain the model parameters 𝑨,ϕ,𝝁,𝝈𝑨italic-ϕ𝝁𝝈{\bm{A}},\phi,\bm{\mu},\bm{\sigma}bold_italic_A , italic_ϕ , bold_italic_μ , bold_italic_σ given a training dataset {𝒟1,,𝒟M}subscript𝒟1subscript𝒟𝑀\{\mathcal{D}_{1},\ldots,\mathcal{D}_{M}\}{ caligraphic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , caligraphic_D start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT } and a coding budget C𝐶Citalic_C. 111As a slight abuse of notation, we use ϕitalic-ϕ\phiitalic_ϕ to denote both the upsampling function and its parameters. In their work, Guo et al. (2023) control the coding budget implicitly by manually setting different values for β𝛽\betaitalic_β in Equation 1. In this paper, we adopt an explicit approach and tune β𝛽\betaitalic_β dynamically based on our desired coding budget of C𝐶Citalic_C bits. More precisely, after every iteration, we calculate the average KL divergence of the training examples, i.e., δ¯=1Mm=1MDKL[q𝐡,m||p𝐡]\bar{\delta}=\frac{1}{M}\sum_{m=1}^{M}D_{\rm{KL}}[q_{{\mathbf{h}},m}||p_{% \mathbf{h}}]over¯ start_ARG italic_δ end_ARG = divide start_ARG 1 end_ARG start_ARG italic_M end_ARG ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT roman_KL end_POSTSUBSCRIPT [ italic_q start_POSTSUBSCRIPT bold_h , italic_m end_POSTSUBSCRIPT | | italic_p start_POSTSUBSCRIPT bold_h end_POSTSUBSCRIPT ]. If δ¯>C¯𝛿𝐶\bar{\delta}>Cover¯ start_ARG italic_δ end_ARG > italic_C, we update β𝛽\betaitalic_β by ββ×(1+τC)𝛽𝛽1subscript𝜏𝐶\beta\leftarrow\beta\times(1+\tau_{C})italic_β ← italic_β × ( 1 + italic_τ start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ); if δ¯<CϵC¯𝛿𝐶subscriptitalic-ϵ𝐶\bar{\delta}<C-\epsilon_{C}over¯ start_ARG italic_δ end_ARG < italic_C - italic_ϵ start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT, we update β𝛽\betaitalic_β by ββ/(1+τC)𝛽𝛽1subscript𝜏𝐶\beta\leftarrow\beta/(1+\tau_{C})italic_β ← italic_β / ( 1 + italic_τ start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ). Here ϵCsubscriptitalic-ϵ𝐶\epsilon_{C}italic_ϵ start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT is a threshold parameter to stabilize the training process and prevent overly frequent updates to β𝛽\betaitalic_β, and τCsubscript𝜏𝐶\tau_{C}italic_τ start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT is the adjustment step size. Unless otherwise stated, we set τC=0.5subscript𝜏𝐶0.5\tau_{C}=0.5italic_τ start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT = 0.5 in our experiments. Empirically, we find the value of β𝛽\betaitalic_β stabilizes after 30303030 to 50505050 iterations. We present the pseudocode of this prior learning algorithm in Algorithm 1. Then, our training step is a three-step coordinate descent process analogous to Guo et al. (2023)’s:

  1. 1.

    Optimize variational parameters, linear transformation and upsampling network: Fix the prior p𝐡subscript𝑝𝐡p_{\mathbf{h}}italic_p start_POSTSUBSCRIPT bold_h end_POSTSUBSCRIPT, and optimize Equation 1 or its modified version from Section 3.3 via gradient descent. Note, that \mathcal{L}caligraphic_L is a function of the linear transform 𝑨𝑨{\bm{A}}bold_italic_A and upsampling network parameters ϕitalic-ϕ\phiitalic_ϕ too:

    {𝝂m,𝝆m}m=1M,𝑨,ϕargmin{𝝂m,𝝆m}m=1M,𝑨,ϕ{1Mm=1M(𝒟m,q𝐡,m,p𝐡,𝑨,ϕ,β)}.superscriptsubscriptsubscript𝝂𝑚subscript𝝆𝑚𝑚1𝑀𝑨italic-ϕsubscriptargminsuperscriptsubscriptsubscript𝝂𝑚subscript𝝆𝑚𝑚1𝑀𝑨italic-ϕ1𝑀superscriptsubscript𝑚1𝑀subscript𝒟𝑚subscript𝑞𝐡𝑚subscript𝑝𝐡𝑨italic-ϕ𝛽\displaystyle\left\{\bm{\nu}_{m},{\bm{\rho}_{m}}\right\}_{m=1}^{M},{\bm{A}},% \phi\,\,\,\,\leftarrow\,\,\operatorname*{arg\,min}_{\left\{\bm{\nu}_{m},{\bm{% \rho}_{m}}\right\}_{m=1}^{M},{\bm{A}},\phi}\left\{\frac{1}{M}{\sum_{m=1}^{M}}% \mathcal{L}(\mathcal{D}_{m},q_{{\mathbf{h}},m},p_{{\mathbf{h}}},{\bm{A}},\phi,% \beta)\right\}.{ bold_italic_ν start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , bold_italic_ρ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT , bold_italic_A , italic_ϕ ← start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT { bold_italic_ν start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , bold_italic_ρ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT , bold_italic_A , italic_ϕ end_POSTSUBSCRIPT { divide start_ARG 1 end_ARG start_ARG italic_M end_ARG ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT caligraphic_L ( caligraphic_D start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , italic_q start_POSTSUBSCRIPT bold_h , italic_m end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT bold_h end_POSTSUBSCRIPT , bold_italic_A , italic_ϕ , italic_β ) } . (6)
  2. 2.

    Update prior: Update the prior parameters by the closed-form solution:

    𝝁1Mm=1M𝝂m,𝝈1Mm=1M[(𝝂m𝝁)2+𝝆m].formulae-sequence𝝁1𝑀superscriptsubscript𝑚1𝑀subscript𝝂𝑚𝝈1𝑀superscriptsubscript𝑚1𝑀delimited-[]superscriptsubscript𝝂𝑚𝝁2subscript𝝆𝑚\displaystyle\bm{\mu}\leftarrow\frac{1}{M}\sum_{m=1}^{M}\bm{\nu}_{m},\quad\bm{% \sigma}\leftarrow\frac{1}{M}\sum_{m=1}^{M}\left[\left(\bm{\nu}_{m}-\bm{\mu}% \right)^{2}+{\bm{\rho}_{m}}\right].bold_italic_μ ← divide start_ARG 1 end_ARG start_ARG italic_M end_ARG ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT bold_italic_ν start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , bold_italic_σ ← divide start_ARG 1 end_ARG start_ARG italic_M end_ARG ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT [ ( bold_italic_ν start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT - bold_italic_μ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + bold_italic_ρ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ] . (7)
  3. 3.

    Update β𝛽\betaitalic_β: Set ββ×(1+τC)𝛽𝛽1subscript𝜏𝐶\beta\leftarrow\beta\times(1+\tau_{C})italic_β ← italic_β × ( 1 + italic_τ start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ) or ββ/(1+τC)𝛽𝛽1subscript𝜏𝐶\beta\leftarrow\beta/(1+\tau_{C})italic_β ← italic_β / ( 1 + italic_τ start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ) based on the procedure described above.

Note that unlike other inr-based methods (Dupont et al., 2022; Schwarz & Teh, 2022; Schwarz et al., 2023) our training procedure is remarkably stable, as we illustrate in Section D.4.

4 Related Works

Nonlinear transform coding: Currently, the dominant paradigm in neural compression is nonlinear transform coding (NTC; Ballé et al., 2020) usually implemented using variational autoencoders (VAE). NTC has achieved impressive performance in terms of both objective metrics (Cheng et al., 2020; He et al., 2022) and perceptual quality (Mentzer et al., 2020), mainly due to their expressive learned non-linear transforms (Ballé et al., 2020; Zhu et al., 2021; Liu et al., 2023) and elaborate entropy models (Ballé et al., 2018; Minnen et al., 2018; Guo et al., 2021).

Compressing inrs can also be viewed as a form of NTC: we use gradent descent to transform data into an inr. The idea to quantize inr weights and entropy code them was first proposed by Dupont et al. (2021), whose method has since been extended significantly (Dupont et al., 2022; Schwarz & Teh, 2022; Schwarz et al., 2023). The current state-of-the-art inr-based method, VC-INR (Schwarz et al., 2023), achieves impressive results across several data modalities, albeit at the cost of significantly higher complexity and still falling short of autoencoder-based NTC methods on images. Our method, following combiner (Guo et al., 2023), differs from all of the above methods, as it uses REC to encode our variational inrs, instead of quantization and entropy coding.

Linear weight reparameterization: Similar to our proposal in Section 3.1, Oktay et al. (2019) learn an affine reparameterization of the weights of large neural networks. They demonstrate that scalar quantization in the transformed space leads to significant gains in compression performance. However, since they are performing one-shot model compression, their linear transformations have very few parameters as they need to transmit them alongside the quantized weights, limiting their expressivity. On the other hand, recombiner learns the linear transform during training after which it is fixed and shared between communicating parties, thus it does not cause any communication overhead. Therefore, our linear transformation can be significantly more expressive.

Positional encodings: Some recent works have demonstrated that learning positional features is beneficial for fitting inrs (Jiang et al., 2020; Kim et al., 2022; Müller et al., 2022; Ladune et al., 2023). Sharing a similar motivation, our method essentially incorporates implicit representations with explicit ones, forming a hybrid inr framework (Chen et al., 2023).

5 Experimental Results

LABEL:legend:methods Refer to caption


(a) RD curve on CIFAR-10 (left) and Kodak (right). We also provide full-resolution plots in Appendix F.
Refer to caption
Refer to caption
(b) Decoded videos and residuals.
Refer to caption
(c) Decoded protein structure examples.
Figure 3: Quantitive evaluation and qualitative examples of recombiner on image, audio, video, and 3D protein structure. Kbps stands for kilobits per second, RMSD stands for Root Mean Square Deviation, and bpa stands for bits per atom. For all plots, we use solid lines to denote inr-based codecs, dotted lines to denote VAE-based codecs, and dashed lines to denote classical codecs.

In this section, we evaluate recombiner on image, audio, video, and 3D protein structure data and demonstrate that it achieves strong performance across all modalities. We also perform extensive ablation studies on the CIFAR-10 and Kodak datasets which demonstrate recombiner’s robustness and the effectiveness of each of our proposed solutions. For all experiments, we use a 4-layer, 32-hidden unit SIREN network (Sitzmann et al., 2020) as the inr architecture unless otherwise stated, and a small 3-layer convolution network as the upsampling network ϕitalic-ϕ\phiitalic_ϕ, as shown in Figure 6 in the appendix. See Appendix C for the detailed description of our experimental setup.

5.1 Data Compression across Modalities

Image: We evaluate recombiner on the CIFAR-10 (Krizhevsky et al., 2009) and Kodak (Kodak, 1993) image datasets, and show its rate-distortion (RD) performance in Figure 2(a), and compare it against recent inr and VAE-based methods, as well as VTM (JVET, 2020)222https://vcgit.hhi.fraunhofer.de/jvet/VVCSoftware_VTM/-/tree/VTM-12.0?ref_type=tags, BPG (Bellard, 2014) and JPEG2000. recombiner displays remarkable performance on CIFAR-10, especially at low bitrates, outperforming even VAE-based codecs. On Kodak, it outperforms most inr-based codecs and is competitive with the more complex VC-INR method of Schwarz et al. (2023). Finally, while recombiner still falls behind VAE-based codecs, it significantly reduces the performance gap.

Audio: Following the experimental set-up of Guo et al. (2023), we evaluate our method on the LibriSpeech (Panayotov et al., 2015) dataset. In LABEL:fig:rd_audio, we depict recombiner’s RD curve on the full test set, alongside the curves of VC-INR, COIN++, and MP3. We can see recombiner outperforms both COIN++ and MP3 and matches with VC-INR. Since Guo et al. (2023) only tested combiner on 24 test clips, we do not include combiner in this plot but put an extra comparison in Figure 13 in Appendix F, where we can also see that recombiner clearly outperforms combiner.

Video: We evaluate recombiner on UCF-101 action recognition dataset (Soomro et al., 2012), following Schwarz et al. (2023)’s experimental setup. However, as they do not report their train-test split and due to the time-consuming encoding process of our approach, we only benchmark our method against H.264 and H.265 on 16 randomly selected video clips. LABEL:fig:rd_video shows recombiner achieves comparable performance to the classic domain-specific codecs H.264 and H.265, especially at lower bitrates. However, there is still a gap between our approach and H.264 and H.265 when they are configured to prioritize quality. Figure 2(b) shows a non-cherry-picked video compressed with recombiner at two different bitrates and its reconstruction errors.

3D Protein Structure: To further illustrate the applicability of our approach, we use it to compress the 3D coordinates of Cα𝛼\alphaitalic_α atoms in protein fragments. We take domain-specific lossy codecs as baselines, including Foldcomp (Kim et al., 2023), PDC (Zhang & Pyle, 2023) and PIC (Staniscia & Yu, 2023). Surprisingly, as shown in LABEL:fig:rd_protein, recombiner’s performance is competitive with highly domain-specific codecs. Furthermore, it allows us to tune its rate-distortion performance, whereas the baselines only support a certain compression rate. Since the experimental resolution of 3D structures is typically between 1-3 Å (RCSB Protein Data Bank, 2000), recombiner could help with reducing the increasing storage demand for protein structures without losing key information. Figure 2(c) shows non-cherry-picked examples compressed with our method.

5.2 Effectiveness of Our Solutions, Ablation Studies and Runtime Analysis

This section showcases recombiner’s robustness to model size and the effectiveness of each component. Section D.1 provides additional visualizations for a deeper understanding of our methods.

Refer to caption
(a) w/o positional encodings;
bitrate 0.287 bpp; PSNR 25.62 dB.
Refer to caption
(b) with positional encodings;
bitrate 0.316 bpp; PSNR 26.85 dB.
Refer to caption
(c) with positional encodings;
bitrate 0.178 bpp; PSNR 25.05 dB.
Figure 4: Comparison between kodim24 details compressed with and without learnable positional encodings. (a)(b) have similar bitrates and (a)(c) have similar PSNRs.
Refer to caption
Figure 5: (a) RD performances of combiner and recombiner with different numbers of hidden units. (b)(c) Ablation studies on CIFAR-10 and Kodak. LR: linear reparameterization; PE: positional encodings; HM: hierarchical model; RP: random permutation across patches. We describe the details of experimental settings for ablation studies in Section C.3.

Positional encodings facilitate local deviations: Figure 4 compares images obtained by recombiner with and without positional encodings at matching bitrates and PSNRs. As we can see, positional encodings preserve intricate details in fine-textured regions while preventing noisy artifacts in other regions of the patches, making recombiner’s reconstructions more visually pleasing.

recombiner is more robust to model size: Using the same inr architecture, LABEL:fig:robustness shows combiner and recombiner’s RD curves as we vary the number of hidden units. recombiner displays minimal performance variation and also consistently outperforms combiner. Based on Figure 7 in Appendix D, this phenomenon is likely due to recombiner’s linear weight reparameterization allowing it to more flexibly prune its weight representations.

Ablation study: In LABEL:fig:ablations_cifar and LABEL:fig:ablations_kodak, we ablate our linear reparameterization, positional encodings, hierarchical model, and permutation strategy on CIFAR-10 and Kodak, with five key takeaways:

  1. 1.

    Linear weight reparameterization consistently improves performance on both datasets, yielding up to 4dB gain on CIFAR-10 at high bitrates and over 0.5 dB gain on Kodak in PSNR.

  2. 2.

    Learnable positional encodings provide more substantial advantages at lower bitrates. On CIFAR-10, the encodings contribute up to 0.5 dB gain when the bitrate falls below 2 bpp. On Kodak, the encodings provide noteworthy gains of 2 dB at low bitrates and 1 dB at high bitrates.

  3. 3.

    Surprisingly, the hierarchical model without positional encodings can degrade performance. We hypothesize that this is because directly applying the hierarchical model poses challenges in optimizing Equation 1. A potential solution is to warm up the rate penalty β𝛽\betaitalic_β level by level akin to what is done in hierarchical VAEs (Sønderby et al., 2016), which we leave for further work.

  4. 4.

    However, positional encodings appear to consistently alleviate this optimization difficulty, yielding 0.5 dB gain when used with hierarchical models.

  5. 5.

    Our proposed permutation strategy provides significant gains of 0.5 dB at low bitrates and more than 1.5 dB at higher bitrates.

Runtime Analysis: We list recombiner’s encoding and decoding times in Section D.5. Unfortunately, our approach exhibits a long encoding time, similar to combiner. However, our decoding process is still remarkably fast, matching the speed of COIN and combiner, even on CPUs.

6 Conclusions and Limitations

In this paper, we propose recombiner, a new codec based on several non-trivial extensions to combiner, encompassing the linear reparameterization for the network weights, learnable positional encodings, and expressive hierarchical Bayesian models for high-resolution signals. Experiments demonstrate that our proposed method sets a new state-of-the-art on low-resolution images at low bitrates, and consistently delivers strong results across other data modalities.

A major limitation of our work is the encoding time complexity and tackling it should be of primary concern in future work. A possible avenue for solving this issue is to reduce the number of parameters to optimize over and switch from inference over weights to modulations using, e.g. FiLM layers (Perez et al., 2018), as is done in other inr-based works. A second limitation is that while compressing with patches enables parallelization and higher robustness, it is suboptimal as it leads to block artifacts, as can be seen in Figure 4. Third, as Guo et al. (2023) demonstrate, the approximate samples given by A*{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT coding significantly impact the methods performance, e.g. by requiring more fine-tuning. An interesting question is whether an exact REC algorithm could be adapted to solve this issue, such as the recently developed greedy Poisson rejection sampler (Flamich, 2023).

7 Acknowledgements

The authors would like to thank Runsen Feng for hel** us ensure that our baseline for our experiments on video compression is correctly set up. GF acknowledges funding from DeepMind. ZG acknowledges funding from the Outstanding PhD Student Program at the University of Science and Technology of China.

References

  • Agustsson & Timofte (2017) Eirikur Agustsson and Radu Timofte. Ntire 2017 challenge on single image super-resolution: Dataset and study. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2017.
  • Ballé et al. (2020) Johannes Ballé, Philip A Chou, David Minnen, Saurabh Singh, Nick Johnston, Eirikur Agustsson, Sung ** Hwang, and George Toderici. Nonlinear transform coding. IEEE Journal of Selected Topics in Signal Processing, 2020.
  • Ballé et al. (2018) Johannes Ballé, David Minnen, Saurabh Singh, Sung ** Hwang, and Nick Johnston. Variational image compression with a scale hyperprior. In International Conference on Learning Representations, 2018.
  • Bellard (2014) Fabrice Bellard. BPG image format. https://bellard.org/bpg/, 2014. Accessed: 2023-09-27.
  • Blundell et al. (2015) Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, and Daan Wierstra. Weight uncertainty in neural network. In International Conference on Machine Learning, 2015.
  • Chen et al. (2023) Hao Chen, Matthew Gwilliam, Ser-Nam Lim, and Abhinav Shrivastava. Hnerf: A hybrid neural representation for videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023.
  • Cheng et al. (2020) Zhengxue Cheng, Heming Sun, Masaru Takeuchi, and Jiro Katto. Learned image compression with discretized gaussian mixture likelihoods and attention modules. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020.
  • Dupont et al. (2021) Emilien Dupont, Adam Golinski, Milad Alizadeh, Yee Whye Teh, and Arnaud Doucet. Coin: Compression with implicit neural representations. In Neural Compression: From Information Theory to Applications–Workshop@ ICLR 2021, 2021.
  • Dupont et al. (2022) Emilien Dupont, Hrushikesh Loya, Milad Alizadeh, Adam Golinski, Y Whye Teh, and Arnaud Doucet. Coin++: Neural compression across modalities. Transactions on Machine Learning Research, 2022.
  • Dusenberry et al. (2020) Michael Dusenberry, Ghassen Jerfel, Yeming Wen, Yian Ma, Jasper Snoek, Katherine Heller, Balaji Lakshminarayanan, and Dustin Tran. Efficient and scalable bayesian neural nets with rank-1 factors. In International conference on machine learning, 2020.
  • Finn et al. (2017) Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In International conference on machine learning, 2017.
  • Flamich (2023) Gergely Flamich. Greedy Poisson rejection sampling. In Advances in Neural Information Processing Systems, 2023.
  • Flamich et al. (2020) Gergely Flamich, Marton Havasi, and José Miguel Hernández-Lobato. Compressing images by encoding their latent representations with relative entropy coding. In Advances in Neural Information Processing Systems, 2020.
  • Flamich et al. (2022) Gergely Flamich, Stratis Markou, and José Miguel Hernández-Lobato. Fast relative entropy coding with A* coding. In International Conference on Machine Learning, 2022.
  • Guo et al. (2021) Zongyu Guo, Zhizheng Zhang, Runsen Feng, and Zhibo Chen. Causal contextual prediction for learned image compression. IEEE Transactions on Circuits and Systems for Video Technology, 2021.
  • Guo et al. (2023) Zongyu Guo, Gergely Flamich, Jiajun He, Zhibo Chen, and José Miguel Hernández-Lobato. Compression with Bayesian implicit neural representations. In Advances in Neural Information Processing Systems, 2023.
  • Havasi et al. (2018) Marton Havasi, Robert Peharz, and José Miguel Hernández-Lobato. Minimal random code learning: Getting bits back from compressed model parameters. In International Conference on Learning Representations, 2018.
  • He et al. (2022) Dailan He, Ziming Yang, Weikun Peng, Rui Ma, Hongwei Qin, and Yan Wang. Elic: Efficient learned image compression with unevenly grouped space-channel contextual adaptive coding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022.
  • Hu et al. (2021) Edward J Hu, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2021.
  • Izmailov et al. (2021) Pavel Izmailov, Sharad Vikram, Matthew D Hoffman, and Andrew Gordon Gordon Wilson. What are Bayesian neural network posteriors really like? In International conference on machine learning, 2021.
  • Jiang et al. (2020) Chiyu Jiang, Avneesh Sud, Ameesh Makadia, **gwei Huang, Matthias Nießner, Thomas Funkhouser, et al. Local implicit grid representations for 3d scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020.
  • JVET (2020) JVET. VVC offical test model. https://jvet.hhi.fraunhofer.de, 2020. Accessed: 2024-03-05.
  • Kim et al. (2023) Hyunbin Kim, Milot Mirdita, and Martin Steinegger. Foldcomp: a library and format for compressing and indexing large protein structure sets. Bioinformatics, 2023.
  • Kim et al. (2022) Subin Kim, Sihyun Yu, Jaeho Lee, and **woo Shin. Scalable neural video representations with learnable positional features. In Advances in Neural Information Processing Systems, 2022.
  • Kingma et al. (2015) Durk P Kingma, Tim Salimans, and Max Welling. Variational dropout and the local reparameterization trick. In Advances in Neural Information Processing Systems, 2015.
  • Kodak (1993) Eastman Kodak. Kodak Lossless True Color Image Suite (PhotoCD PCD0992). http://r0k.us/graphics/kodak/, 1993.
  • Koller & Friedman (2009) Daphne Koller and Nir Friedman. Probabilistic graphical models: principles and techniques. MIT press, 2009.
  • Krizhevsky et al. (2009) Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images, 2009.
  • Ladune et al. (2023) Théo Ladune, Pierrick Philippe, Félix Henry, Gordon Clare, and Thomas Leguay. Cool-chic: Coordinate-based low complexity hierarchical image codec. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023.
  • Liu et al. (2023) **ming Liu, Heming Sun, and Jiro Katto. Learned image compression with mixed transformer-cnn architectures. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023.
  • Mentzer et al. (2020) Fabian Mentzer, George D Toderici, Michael Tschannen, and Eirikur Agustsson. High-fidelity generative image compression. In Advances in Neural Information Processing Systems, 2020.
  • Minnen et al. (2018) David Minnen, Johannes Ballé, and George D Toderici. Joint autoregressive and hierarchical priors for learned image compression. In Advances in neural information processing systems, 2018.
  • Müller et al. (2022) Thomas Müller, Alex Evans, Christoph Schied, and Alexander Keller. Instant neural graphics primitives with a multiresolution hash encoding. ACM Transactions on Graphics, 2022.
  • Oktay et al. (2019) Deniz Oktay, Johannes Ballé, Saurabh Singh, and Abhinav Shrivastava. Scalable model compression by entropy penalized reparameterization. In International Conference on Learning Representations, 2019.
  • Panayotov et al. (2015) Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. Librispeech: An asr corpus based on public domain audio books. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015.
  • Perez et al. (2018) Ethan Perez, Florian Strub, Harm De Vries, Vincent Dumoulin, and Aaron Courville. Film: Visual reasoning with a general conditioning layer. In AAAI conference on artificial intelligence, 2018.
  • RCSB Protein Data Bank (2000) RCSB Protein Data Bank. PDB Statistics: PDB data distribution by resolution. https://www.rcsb.org/stats/distribution-resolution, 2000. Accessed: 2023-09-27.
  • Schwarz & Teh (2022) Jonathan Richard Schwarz and Yee Whye Teh. Meta-learning sparse compression networks. Transactions on Machine Learning Research, 2022.
  • Schwarz et al. (2023) Jonathan Richard Schwarz, Jihoon Tack, Yee Whye Teh, Jaeho Lee, and **woo Shin. Modality-agnostic variational compression of implicit neural representations. In International conference on machine learning, 2023.
  • Sitzmann et al. (2020) Vincent Sitzmann, Julien N. P. Martel, Alexander W. Bergman, David B. Lindell, and Gordon Wetzstein. Implicit neural representations with periodic activation functions. In Advances in Neural Information Processing Systems, 2020.
  • Sønderby et al. (2016) Casper Kaae Sønderby, Tapani Raiko, Lars Maaløe, Søren Kaae Sønderby, and Ole Winther. Ladder variational autoencoders. In Advances in neural information processing systems, 2016.
  • Soomro et al. (2012) Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012.
  • Staniscia & Yu (2023) Luke Staniscia and Yun William Yu. Image-centric compression of protein structures improves space savings. BMC Bioinformatics, 2023.
  • Stanley (2007) Kenneth O Stanley. Compositional pattern producing networks: A novel abstraction of development. Genetic programming and evolvable machines, 2007.
  • Tancik et al. (2020) Matthew Tancik, Pratul Srinivasan, Ben Mildenhall, Sara Fridovich-Keil, Nithin Raghavan, Utkarsh Singhal, Ravi Ramamoorthi, Jonathan Barron, and Ren Ng. Fourier features let networks learn high frequency functions in low dimensional domains. In Advances in Neural Information Processing Systems, 2020.
  • Tomar (2006) Suramya Tomar. Converting video formats with FFmpeg. Linux Journal, 2006.
  • Trippe & Turner (2017) Brian Trippe and Richard Turner. Overpruning in variational Bayesian neural networks. In Advances in Approximate Bayesian Inference workshop at NIPS 2017, 2017.
  • Zhang & Pyle (2023) Chengxin Zhang and Anna Marie Pyle. PDC: a highly compact file format to store protein 3D coordinates. Database (Oxford), 2023.
  • Zhu et al. (2021) Yinhao Zhu, Yang Yang, and Taco Cohen. Transformer-based transform coding. In International Conference on Learning Representations, 2021.

Appendix A Notations

We summarize the notations used in this paper in Table 1:

Notation Name
β𝛽\betaitalic_β rate penalty hyperparameter in Equation 1
C𝐶Citalic_C coding budget
τCsubscript𝜏𝐶\tau_{C}italic_τ start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT step size for adjusting β𝛽\betaitalic_β
ϵCsubscriptitalic-ϵ𝐶\epsilon_{C}italic_ϵ start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT threshold parameter to stabilize training when adjusting β𝛽\betaitalic_β
𝐰𝐰{\mathbf{w}}bold_w weights in inr
𝐱isubscript𝐱𝑖{\mathbf{x}}_{i}bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT i𝑖iitalic_ith coordinate
𝐲isubscript𝐲𝑖{\mathbf{y}}_{i}bold_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT i𝑖iitalic_ith signal value
𝐳isubscript𝐳𝑖{\mathbf{z}}_{i}bold_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT recombiner’s upsampled positional encodings at coordinate 𝐱isubscript𝐱𝑖{\mathbf{x}}_{i}bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT
𝐡𝐰subscript𝐡𝐰{\mathbf{h}}_{\mathbf{w}}bold_h start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT recombiner’s latent inr weights
𝐡𝐳subscript𝐡𝐳{\mathbf{h}}_{\mathbf{z}}bold_h start_POSTSUBSCRIPT bold_z end_POSTSUBSCRIPT recombiner’s latent positional encodings
𝐡𝐰(π)superscriptsubscript𝐡𝐰𝜋{\mathbf{h}}_{\mathbf{w}}^{(\pi)}bold_h start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_π ) end_POSTSUPERSCRIPT latent inr weights for π𝜋\piitalic_πth patch (lowest level of the hierarchical model)
𝐡𝐳(π)superscriptsubscript𝐡𝐳𝜋{\mathbf{h}}_{\mathbf{z}}^{(\pi)}bold_h start_POSTSUBSCRIPT bold_z end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_π ) end_POSTSUPERSCRIPT latent positional encodings for π𝜋\piitalic_πth patch (lowest level of the hierarchical model)
𝐡¯𝐰(g)superscriptsubscript¯𝐡𝐰𝑔\overline{{\mathbf{h}}}_{\mathbf{w}}^{(g)}over¯ start_ARG bold_h end_ARG start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_g ) end_POSTSUPERSCRIPT g𝑔gitalic_gth representation in the second level of the hierarchical model
𝐡¯¯𝐰subscript¯¯𝐡𝐰\overline{\overline{{\mathbf{h}}}}_{\mathbf{w}}over¯ start_ARG over¯ start_ARG bold_h end_ARG end_ARG start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT third level representations of the hierarchical model
𝝂𝝂\bm{\nu}bold_italic_ν mean of the Gaussian posterior
𝝁𝝁\bm{\mu}bold_italic_μ mean of the Gaussian prior
𝝆𝝆\bm{\rho}bold_italic_ρ diagonal of the covariance matrix of the Gaussian posterior
𝝈𝝈\bm{\sigma}bold_italic_σ diagonal of the covariance matrix of the Gaussian prior
𝑨𝑨{\bm{A}}bold_italic_A recombiner’s linear transform on inr weights
𝑯()superscript𝑯{\bm{H}}^{(\ell)}bold_italic_H start_POSTSUPERSCRIPT ( roman_ℓ ) end_POSTSUPERSCRIPT matrix stacking representations in the \ellroman_ℓth level defined in Equation 5
𝑯~()superscript~𝑯\widetilde{{\bm{H}}}^{(\ell)}over~ start_ARG bold_italic_H end_ARG start_POSTSUPERSCRIPT ( roman_ℓ ) end_POSTSUPERSCRIPT matrix for representations in the \ellroman_ℓth level after permutation
𝒟𝒟\mathcal{D}caligraphic_D a signal data point (as a dataset with coordinate-value pairs)
Snsubscript𝑆𝑛S_{n}italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT set of all permutations on n𝑛nitalic_n elements
γ()𝛾\gamma(\cdot)italic_γ ( ⋅ ) Fourier embedding to coordinates
α(),κ()𝛼𝜅\alpha(\cdot),\kappa(\cdot)italic_α ( ⋅ ) , italic_κ ( ⋅ ) a permutation
ϕ()italic-ϕ\phi(\cdot)italic_ϕ ( ⋅ ) upsampling network for positional encodings
g(𝐰)g(\cdot\mid{\mathbf{w}})italic_g ( ⋅ ∣ bold_w ) inr with weights 𝐰𝐰{\mathbf{w}}bold_w
Table 1: Notations.

Appendix B recombiner’s Training Algorithms

We describe the algorithm to train recombiner in Algorithm 1.

Algorithm 1 Training recombiner: the prior, the linear transform 𝑨𝑨{\bm{A}}bold_italic_A and upsampling network ϕitalic-ϕ\phiitalic_ϕ
Training data {𝒟1,,𝒟M}subscript𝒟1subscript𝒟𝑀\{\mathcal{D}_{1},...,\mathcal{D}_{M}\}{ caligraphic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , caligraphic_D start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT }; desired bitrate C𝐶Citalic_C.
Initialize: q𝐡,m=𝒩(𝝂m,diag(𝝆m))subscript𝑞𝐡𝑚𝒩subscript𝝂𝑚diagsubscript𝝆mq_{{\mathbf{h}},m}=\mathcal{N}\left({\bm{\nu}}_{m},\rm{diag}\left({{{\bm{\rho}% }_{m}}}\right)\right)italic_q start_POSTSUBSCRIPT bold_h , italic_m end_POSTSUBSCRIPT = caligraphic_N ( bold_italic_ν start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , roman_diag ( bold_italic_ρ start_POSTSUBSCRIPT roman_m end_POSTSUBSCRIPT ) ) for every training instance 𝒟msubscript𝒟𝑚\mathcal{D}_{m}caligraphic_D start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT.
Initialize: p𝐡=𝒩(𝝁,diag(𝝈))subscript𝑝𝐡𝒩𝝁diag𝝈p_{\mathbf{h}}=\mathcal{N}\left({\bm{\mu}},\rm{diag}\left({{{\bm{\sigma}}}}% \right)\right)italic_p start_POSTSUBSCRIPT bold_h end_POSTSUBSCRIPT = caligraphic_N ( bold_italic_μ , roman_diag ( bold_italic_σ ) ).
Initialize: 𝑨𝑨\bm{A}bold_italic_A, ϕitalic-ϕ\phiitalic_ϕ.
repeat until convergence
      # Step 1: Optimize posteriors, linear reparameterization matrix, and upsampling network
     {𝝂m,𝝆m}m=1M,𝑨,ϕargmin{𝝂m,𝝆m}m=1M,𝑨,ϕ{1Mm=1M(𝒟m,q𝐡,m,p𝐡,𝑨,ϕ,β)}.superscriptsubscriptsubscript𝝂𝑚subscript𝝆𝑚𝑚1𝑀𝑨italic-ϕsubscriptargminsuperscriptsubscriptsubscript𝝂𝑚subscript𝝆𝑚𝑚1𝑀𝑨italic-ϕ1𝑀superscriptsubscript𝑚1𝑀subscript𝒟𝑚subscript𝑞𝐡𝑚subscript𝑝𝐡𝑨italic-ϕ𝛽\left\{\bm{\nu}_{m},{\bm{\rho}_{m}}\right\}_{m=1}^{M},{\bm{A}},\phi\,\,\,\,% \leftarrow\,\,\operatorname*{arg\,min}_{\left\{\bm{\nu}_{m},{\bm{\rho}_{m}}% \right\}_{m=1}^{M},{\bm{A}},\phi}\left\{\frac{1}{M}\sum_{m=1}^{M}\mathcal{L}(% \mathcal{D}_{m},q_{{\mathbf{h}},m},p_{{\mathbf{h}}},{\bm{A}},\phi,\beta)\right\}.{ bold_italic_ν start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , bold_italic_ρ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT , bold_italic_A , italic_ϕ ← start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT { bold_italic_ν start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , bold_italic_ρ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT , bold_italic_A , italic_ϕ end_POSTSUBSCRIPT { divide start_ARG 1 end_ARG start_ARG italic_M end_ARG ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT caligraphic_L ( caligraphic_D start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , italic_q start_POSTSUBSCRIPT bold_h , italic_m end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT bold_h end_POSTSUBSCRIPT , bold_italic_A , italic_ϕ , italic_β ) } . \triangleright Optimize by Equation 6
      # Step 2: Update prior
     𝝁1Mm=1M𝝂m,𝝈1Mm=1M[(𝝂m𝝁)2+𝝆m].formulae-sequence𝝁1𝑀superscriptsubscript𝑚1𝑀subscript𝝂𝑚𝝈1𝑀superscriptsubscript𝑚1𝑀delimited-[]superscriptsubscript𝝂𝑚𝝁2subscript𝝆𝑚\bm{\mu}\leftarrow\frac{1}{M}\sum_{m=1}^{M}\bm{\nu}_{m},\quad\bm{\sigma}% \leftarrow\frac{1}{M}\sum_{m=1}^{M}\left[\left(\bm{\nu}_{m}-\bm{\mu}\right)^{2% }+{\bm{\rho}_{m}}\right].bold_italic_μ ← divide start_ARG 1 end_ARG start_ARG italic_M end_ARG ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT bold_italic_ν start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , bold_italic_σ ← divide start_ARG 1 end_ARG start_ARG italic_M end_ARG ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT [ ( bold_italic_ν start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT - bold_italic_μ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + bold_italic_ρ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ] .\triangleright Update by Equation 7
      # Step 3: Update β𝛽\betaitalic_β
     δ¯=1Mm=1MDKL[q𝐡,m||p𝐡]\bar{\delta}=\frac{1}{M}\sum_{m=1}^{M}D_{\rm{KL}}[q_{{\mathbf{h}},m}||p_{% \mathbf{h}}]over¯ start_ARG italic_δ end_ARG = divide start_ARG 1 end_ARG start_ARG italic_M end_ARG ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT roman_KL end_POSTSUBSCRIPT [ italic_q start_POSTSUBSCRIPT bold_h , italic_m end_POSTSUBSCRIPT | | italic_p start_POSTSUBSCRIPT bold_h end_POSTSUBSCRIPT ]. \triangleright Calculate the average training KL
     if δ¯>C¯𝛿𝐶\bar{\delta}>Cover¯ start_ARG italic_δ end_ARG > italic_C then
         ββ×(1+τC)𝛽𝛽1subscript𝜏𝐶\beta\leftarrow\beta\times(1+\tau_{C})italic_β ← italic_β × ( 1 + italic_τ start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT )\triangleright Increase β𝛽\betaitalic_β if budget is exceeded
     end if
     if δ¯<CϵC¯𝛿𝐶subscriptitalic-ϵ𝐶\bar{\delta}<C-\epsilon_{C}over¯ start_ARG italic_δ end_ARG < italic_C - italic_ϵ start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT then
         ββ/(1+τC)𝛽𝛽1subscript𝜏𝐶\beta\leftarrow\beta/(1+\tau_{C})italic_β ← italic_β / ( 1 + italic_τ start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT )\triangleright Decrease β𝛽\betaitalic_β if budget is not fully occupied
     end if
end repeat
Return: p𝐡=𝒩(𝝁,diag(𝝈))subscript𝑝𝐡𝒩𝝁diag𝝈p_{\mathbf{h}}=\mathcal{N}\left({\bm{\mu}},\rm{diag}\left({{{\bm{\sigma}}}}% \right)\right)italic_p start_POSTSUBSCRIPT bold_h end_POSTSUBSCRIPT = caligraphic_N ( bold_italic_μ , roman_diag ( bold_italic_σ ) ), 𝑨𝑨\bm{A}bold_italic_A, ϕitalic-ϕ\phiitalic_ϕ.

Appendix C Supplementary Experimental Details

C.1 Datasets and More Details on Experiments

In this section, we describe the dataset and our experimental settings. We depict the upsampling network we used in Figure 6 and summarize the hyperparameters for each modality in Table 2. Besides, we present details for the baselines in Section C.2.

Note, that as the proposed linear reparameterization yields a full-covariance Gaussian posterior over the weights in the inr, the local reparameterization trick (Kingma et al., 2015) is not applicable in recombiner. Therefore, in the above experiments, when inferring the posteriors of a test signal, we employ a Monte Carlo estimator with 5 samples to estimate the expectation in β𝛽\betaitalic_β-ELBO in Equation 1. While during the training stage, we still use 1 sample. In Section D.3, we provide an analysis of the sample size’s influence. It is worth noting that using just 1 sample during inferring does not significantly deteriorate performance, and therefore we have the flexibility to reduce the sample size when prioritizing encoding time, with marginal performance impact.

CIFAR-10: CIFAR-10 is a set of low-resolution images with a size of 32×32323232\times 3232 × 32. It has a training set of 50,000 images and a test set of 10,000 images. We randomly select 15,000 images from the training set for the training stage and evaluate RD performance on all test images. we use SIREN network (Sitzmann et al., 2020) with 4 layers and 32 hidden units as the inr architecture.

Kodak: Kodak dataset is a commonly used image compression benchmark, containing 24 images with resolutions of either 768×512768512768\times 512768 × 512 or 512×768512768512\times 768512 × 768. In our experiments, we split each images into 96 patches with size 64×64646464\times 6464 × 64. Lacking a standard training set, we randomly select and crop 83838383 images with the same size (splitting into 7,968 patches) from the DIV2K dataset (Agustsson & Timofte, 2017) as the training set. We compress each Kodak image in 64×64646464\times 6464 × 64 patches. For each patch, we use the same inr setup as that for CIFAR-10, i.e., SIREN network (Sitzmann et al., 2020) with 4 layers and 32 hidden units. Besides, we apply a three-level hierarchical Bayesian model to Kodak patches. The lowest level has 96 patches. Every 16 (4×4444\times 44 × 4) patches are grouped together in the second level, and in total there are 6 groups. The highest level consists of a global representation for the entire image.

Audio: LibriSpeech (Panayotov et al., 2015) is a speech dataset recorded at a 16kHz sampling rate. We follow the experiment settings by Guo et al. (2023), taking the first 3 seconds of every recording, corresponding to 48,000 audio samples. We compress each audio clip with 60 patches, each of which has 800 audio samples. For each patch, we use the same inr architecture as CIFAR-10 except the output of the network has only one dimension. We train recombiner on 197 training instances (corresponding to 11,820 patches) and evaluate it on the test set split by Guo et al. (2023), consisting of 24 instances. We also apply a three-level hierarchical model. The lowest level consists of 60 patches. Every 4 patches are grouped together in the second level, and in total there are 60/4=166041660/4=1660 / 4 = 16 groups. The highest level consists of a global representation for the entire signal.

Video: UCF-101 (Soomro et al., 2012) is a dataset of human actions. It consists of 101 action classes, over 13k clips, and 27 hours of video data. We follow Schwarz et al. (2023) center-crop** each video clip to 240×240×2424024024240\times 240\times 24240 × 240 × 24 and then resizing them to 128×128×2412812824128\times 128\times 24128 × 128 × 24. Then we compress each clip with 16×16×2416162416\times 16\times 2416 × 16 × 24 patches. We train recombiner on 75 video clips (4,800 patches) and evaluate it on 16 randomly selected clips. For each patch, we still use the inr with 4 layers and 32 hidden units. We also apply the three-level hierarchical model. The lowest level consists of 64 patches. Every 16 4×4444\times 44 × 4 patches are grouped together in the second level, and in total, there are 4 groups. The highest level consists of a global representation for the entire clip. 3D Protein structure: We evaluate recombiner on the Saccharomyces cerevisiae proteome from the AlphaFold DB v4333https://ftp.ebi.ac.uk/pub/databases/alphafold/v4/UP000002311_559292_YEAST_v4.tar. To standardize the dataset, for each protein, we take the Cα𝛼\alphaitalic_α atom of the first 96 residues (i.e., amino acids) as the target data to be compressed. The input coordinates are the indices of the Cα𝛼\alphaitalic_α atoms (varying between 1-96, and normalized between 0-1) and the outputs of inrs are their corresponding 3D coordinates. We randomly select 1,000 structures as the test set and others as the training set. We still use the same INR architecture as CIFAR-10, i.e., SIREN network with 4 layers and 32 hidden units in each layer. We use the standard MSE as the distortion measure. Note that our method can also be extended to take the fact that the 3D structure is rotation and translation invariant into account by using different losses.

Image Audio Video Protein
Cifar-10 Kodak
Patching
patch or not
patch size \\\backslash\ 64×64646464\times 6464 × 64 800 16×16×2416162416\times 16\times 2416 × 16 × 24 \\\backslash\
hierarchical model levels \\\backslash\ 3 3 3 \\\backslash\
number of patches (lowest level) \\\backslash\ 96 60 64 \\\backslash\
number of groups of patches (middle level) \\\backslash\ 6 16 4 \\\backslash\
number of groups of groups (highest level) \\\backslash\ 1 1 1 \\\backslash\
Positional Encodings
latent positional encoding shape 1282212822128\cdot 2\cdot 2128 ⋅ 2 ⋅ 2 1284412844128\cdot 4\cdot 4128 ⋅ 4 ⋅ 4 1285012850128\cdot 50128 ⋅ 50 128111128111128\cdot 1\cdot 1\cdot 1128 ⋅ 1 ⋅ 1 ⋅ 1 12861286128\cdot 6128 ⋅ 6
latent positional encoding param number 512 2560 6400 128 768
upsampled positional encoding shape 16×32×3216323216\times 32\times 3216 × 32 × 32 16×64×6416646416\times 64\times 6416 × 64 × 64 16×8001680016\times 80016 × 800 16×16×16×241616162416\times 16\times 16\times 2416 × 16 × 16 × 24 16×96169616\times 9616 × 96
INR Architecture
layers 4
hidden units 32
Fourier embeddings dimension 16 16 16 18 (163163\frac{16}{3}divide start_ARG 16 end_ARG start_ARG 3 end_ARG is not integer) 16
output dimension 3 3 1 3 1
number of parameters 3267 3267 3201 3331 3201
Training Stage
training size 15000 83 (7968 patches) 197 (11820 patches) 75 (4800 patches) 4691
epochs 550
optimizer Adam (lr=0.0002)
sample size to estimate β𝛽\betaitalic_β-ELBO 1
gradient iteration between updating prior 100
the first gradient iteration 200
initial posterior variance 9×1069superscript1069\times 10^{-6}9 × 10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT
initial posterior mean SIREN initialization
initial 𝑨[l]superscript𝑨delimited-[]𝑙{\bm{A}}^{[l]}bold_italic_A start_POSTSUPERSCRIPT [ italic_l ] end_POSTSUPERSCRIPT values
A𝒰(1/a,1/a),a=dindoutformulae-sequencesimilar-to𝐴𝒰1𝑎1𝑎𝑎subscript𝑑𝑖𝑛subscript𝑑𝑜𝑢𝑡A\sim\mathcal{U}(-1/a,1/a),a=d_{in}d_{out}italic_A ∼ caligraphic_U ( - 1 / italic_a , 1 / italic_a ) , italic_a = italic_d start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT where
dinsubscript𝑑𝑖𝑛d_{in}italic_d start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT and doutsubscript𝑑𝑜𝑢𝑡d_{out}italic_d start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT are the input and output dimension for layer l𝑙litalic_l.
ϵCsubscriptitalic-ϵ𝐶\epsilon_{C}italic_ϵ start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT 0.3 bpp 0.05 bpp 0.5 kbps 0.3 bpp 0.3 bpa
β𝛽\betaitalic_β Adaptively adjusted. Initial value 1×1081superscript1081\times 10^{-8}1 × 10 start_POSTSUPERSCRIPT - 8 end_POSTSUPERSCRIPT
Posterior Inferring and Compression Stage
gradient descent iteration 30000
optimizer Adam (lr=0.0002)
sample size to estimate β𝛽\betaitalic_β-ELBO 5
blocks per signal
(total number of blocks)
{19,46,60,98,
123,214,281}
{1819, 3187,
4373,7770,
12004, 23898}
{1066, 1999,
4146, 8182}
{2827, 5992,
14858, 29073}
{67, 211, 364
503, 637}
bits per block 16 bits
blocks in the lowest level (patch)
\\\backslash\
{17, 30,
41, 73,
114, 233}
{15, 31,
64, 122}
{34, 71,
198, 409}
\\\backslash\
blocks in the middle level
\\\backslash\
{17, 34,
52, 102,
145, 211}
{5, 5,
14, 50}
{109, 284,
427, 561}
\\\backslash\
blocks in the highest level
\\\backslash\
{85,103,
125, 150,
190, 264}
{31, 64,
96, 112}
{215, 312,
478, 653}
\\\backslash\
Table 2: Hyperparameters for images, audio, video, and protein structure compression.
Refer to caption
Figure 6: Architecture of the up-sampling network ϕitalic-ϕ\phiitalic_ϕ for learnable positional encodings. The numbers in the convolution layer represent the number of input channels, the number of output channels, and kernel size respectively. same padding mode is used in all convolution layers. The kernel dimension depends on the modality, for instances, we use kernels with sizes of 5, 3, 3 for audio and proteins, kernels with sizes of 5×5, 3×3, 3×35533335\times 5,\ 3\times 3,\ 3\times 35 × 5 , 3 × 3 , 3 × 3 for images, and kernels with sizes of 5×5×5, 3×3×3, 3×3×35553333335\times 5\times 5,\ 3\times 3\times 3,\ 3\times 3\times 35 × 5 × 5 , 3 × 3 × 3 , 3 × 3 × 3 for video.

C.2 Baseline Settings

The baseline performances, including JPEG2000, BPG, COIN, COIN++, Ballé et al. (2018) and Cheng et al. (2020) on CIFAR-10 and Kodak, and MP3 and COIN++ on the full test set of LibriSpeech, are taken from the COIN++’s GitHub repo444https://github.com/EmilienDupont/coinpp. Statistics for VC-INR and MSCN are provided by the authors in the paper. We also include a comparison of recombiner and combiner on 24 test audio clips since the authors of combiner did not test on the full test set. For this comparison, the performances of combiner and MP3 on 24 test audio clips are provided by the authors of combiner.

Below, we describe details about the baseline of the video and protein structure compression.

C.2.1 Video Baselines

Video compression baselines are implemented by ffmpeg (Tomar, 2006), with the following commands.

H.264 (best speed):

ffmpeg.exe -i INPUT.avi -c:v libx264 -preset ultrafast -crf $CRF OUTPUT.mkv

H.265 (best speed):

ffmpeg.exe -i INPUT.avi -c:v libx265 -preset ultrafast -crf $CRF OUTPUT.mkv

H.264 (best quality):

ffmpeg.exe -i INPUT.avi -c:v libx264 -preset veryslow -crf $CRF OUTPUT.mkv

H.265 (best quality):

ffmpeg.exe -i INPUT.avi -c:v libx265 -preset veryslow -crf $CRF OUTPUT.mkv

The argument $CRF varies in 15 20 25 30 35 40.

C.2.2 Protein Baselines

PIC first employs a lossy map**, converting the 3D coordinates of atoms to an image, and then lossless compresses the image in PNG format. We use the PNG image size to calculate the bitrate.

As for PDC and Foldcomp, since they directly operate on PDB files containing other information like the headers, sequences, B factor, etc., we cannot use the file size directly. Therefore, we use their theoretical bitrates as our baseline. Below we present how we calculate their theoretical bitrates.

PDC uses three 4-byte integers to save the coordinates of the first Cα𝛼\alphaitalic_α atom, and three 1-byte integers for coordinate differences of all remaining Cα𝛼\alphaitalic_α atoms. Therefore, in theory, for a 96-residue length protein, each Cα𝛼\alphaitalic_α atom is assigned with (8×3×95+4×8×3×1)/968395483196(8\times 3\times 95+4\times 8\times 3\times 1)/96( 8 × 3 × 95 + 4 × 8 × 3 × 1 ) / 96 bits.

Foldcomp compresses the quantized dihedral/bond angles for each residue. Every residue needs 59 bits. Besides, Foldcomp saves uncompressed coordinates for every 25 residues as anchors, which requires 36363636 bytes. Therefore, the theoretical number of bits assigned to each Cα𝛼\alphaitalic_α is given by (36×8+59×25)/25368592525(36\times 8+59\times 25)/25( 36 × 8 + 59 × 25 ) / 25. However, since Foldcomp is designed to encode all backbone atoms (C, N, Cα𝛼\alphaitalic_α) instead of merely Cα𝛼\alphaitalic_α, it is unfair to compare in this way. We thus also report its performance on all backbone atoms for reference.

C.3 Ablation Study Settings

In this section, we describe the details settings for ablation studies in LABEL:fig:ablations_cifar and LABEL:fig:ablations_kodak.

Experiments without Linear Reparameterization: We simply set 𝐰=𝐡𝐰𝐰subscript𝐡𝐰{\mathbf{w}}={\mathbf{h}}_{\mathbf{w}}bold_w = bold_h start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT without the linear matrix 𝑨𝑨{\bm{A}}bold_italic_A. Besides, since in this case, 𝐰𝐰{\mathbf{w}}bold_w follows mean-field Gaussian, we use the local reparameterization trick with 1 sample to reduce the variance during both training and inferring.

Experiments without Positional Encodings: Recall that the inputs of inrs in recombiner is the concatenation of Fourier transformed coordinates γ(𝐱i)𝛾subscript𝐱𝑖\gamma{({\mathbf{x}}_{i})}italic_γ ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) and the upsampled positional encodings at the corresponding position 𝐳i=ϕ(𝐡𝐳)𝐱isubscript𝐳𝑖italic-ϕsubscriptsubscript𝐡𝐳subscript𝐱𝑖{\mathbf{z}}_{i}=\phi({\mathbf{h}}_{\mathbf{z}})_{{\mathbf{x}}_{i}}bold_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_ϕ ( bold_h start_POSTSUBSCRIPT bold_z end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT. In the experiments without positional encodings, we only input the Fourier transformed coordinates to the inr. To keep the inr size consistent, we also increase the dimension of the Fourier transformation, so that dim(γ(𝐱i))dim(γ(𝐱i))+dim(𝐳i)dimsuperscript𝛾subscript𝐱𝑖dim𝛾subscript𝐱𝑖dimsubscript𝐳𝑖\text{dim}(\gamma^{\prime}{({\mathbf{x}}_{i})})\leftarrow\text{dim}(\gamma{({% \mathbf{x}}_{i})})+\text{dim}({\mathbf{z}}_{i})dim ( italic_γ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) ← dim ( italic_γ ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) + dim ( bold_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ). Also, we no longer need to train the upsampling network ϕitalic-ϕ\phiitalic_ϕ.

Experiments without Hierarchical Model: We assume all patch-inrs are independent and simply assign independent mean-field Gaussian priors and posteriors over 𝐡𝐰(π)superscriptsubscript𝐡𝐰𝜋{\mathbf{h}}_{\mathbf{w}}^{(\pi)}bold_h start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_π ) end_POSTSUPERSCRIPT for each patch.

Experiments without Random Permutation across patches: Recall in recombiner, for each level in the hierarchical model, we stack the representations together into a matrix, where each row is one representation. We then (a) apply the same permutation over all rows. This is the same as combiner and is to ensure KL is distributed uniformly across the entire representation for each patch. Then (b) for each column, we apply its own permutation to encourage KL to be distributed uniformly across patches. In the ablation study, we do not only apply the permutation in (b) but still perform the permutation in (a).

Appendix D Supplementary Experiments and Results

D.1 Methods Visualization

Refer to caption
(a) Visualization of 4 channels in the upsampled positional encodings for kodim03 at 0.488 bpp. Patches are stitched together for a clearer visualization.
Refer to caption
(b) Visualization of the information contained in encoded 𝐡𝐰subscript𝐡𝐰{\mathbf{h}}_{\mathbf{w}}bold_h start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT for kodim03 at 0.488 bpp. Patches are stitched together.
Refer to caption
(c) Visualization of 𝑨[2]superscript𝑨delimited-[]2{\bm{A}}^{[2]}bold_italic_A start_POSTSUPERSCRIPT [ 2 ] end_POSTSUPERSCRIPT at 0.074 bpp.
Refer to caption
(d) Visualization of 𝑨[2]superscript𝑨delimited-[]2{\bm{A}}^{[2]}bold_italic_A start_POSTSUPERSCRIPT [ 2 ] end_POSTSUPERSCRIPT at 0.972 bpp.
Figure 7: Visualizations.

In this section, we bring insights into our methods by visualizations. Recall that each signal is represented by 𝐡𝐙subscript𝐡𝐙{\mathbf{h}}_{\mathbf{Z}}bold_h start_POSTSUBSCRIPT bold_Z end_POSTSUBSCRIPT and 𝐡𝐰subscript𝐡𝐰{\mathbf{h}}_{\mathbf{w}}bold_h start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT together in recombiner. We visualize the information contained in each of them. Besides, we visualize the linear transform 𝑨𝑨{\bm{A}}bold_italic_A to understand how it improves performances.

Positional encodings: We take kodim03 at 0.488 bpp as an example, and visualize 4 channels of its upsampled positional encodings ϕ(𝐡𝐳)italic-ϕsubscript𝐡𝐳\phi({\mathbf{h}}_{\mathbf{z}})italic_ϕ ( bold_h start_POSTSUBSCRIPT bold_z end_POSTSUBSCRIPT ) in Fig 6(a). Interestingly, before fed into the inr, the positional encodings already present a pattern of the image. This is an indication of how the learnable positional encodings help with the fitting. When the target signal is intricate, and there is a strict bitrate constraint, the INR capacity is insufficient for learning the complex map** from coordinates to signal values directly. On the other hand, when combined with positional encodings, INR simply needs to extract, combine, and enhance this information, instead of “creating” information from scratch. This aligns with the findings of the ablation study, which indicate that learnable positional encodings have a more significant impact on CIFAR-10 at low bitrates and the Kodak dataset, but a small effect on CIFAR-10 at high bitrates.

Information contained in 𝐡𝐰subscript𝐡𝐰{\mathbf{h}}_{\mathbf{w}}bold_h start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT: To visualize the information contained in 𝐡𝐰subscript𝐡𝐰{\mathbf{h}}_{\mathbf{w}}bold_h start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT, we also take kodim03 at 0.488 bpp as an example. We reconstruct the image using 𝐡𝐰subscript𝐡𝐰{\mathbf{h}}_{\mathbf{w}}bold_h start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT for this image but mask out 𝐡𝐙subscript𝐡𝐙{\mathbf{h}}_{\mathbf{Z}}bold_h start_POSTSUBSCRIPT bold_Z end_POSTSUBSCRIPT by the prior mean. The image reconstructed in this way is shown in Fig 6(b).

From the figure, we can clearly see 𝐡𝐰subscript𝐡𝐰{\mathbf{h}}_{\mathbf{w}}bold_h start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT mostly captures the color specific to each patch, in comparison to the positional encodings containing information more about edges and shapes. Moreover, interestingly, we can see patches close to each other share similar patterns, indicating the redundancy between patches. This explains why employing the hierarchical model provides substantial gains, especially when applying it together with positional encodings.

Linear Transform A𝐴{\bm{A}}bold_italic_A: To interpret how the linear reparameterization works, we take the Kodak dataset as an example, and visualize 𝑨𝑨{\bm{A}}bold_italic_A for the second layer (i.e., 𝑨[2]superscript𝑨delimited-[]2{\bm{A}}^{[2]}bold_italic_A start_POSTSUPERSCRIPT [ 2 ] end_POSTSUPERSCRIPT) at 0.074 and 0.972 bpp in Fig 6(c) and 6(d). Note that this layer has 32 hidden units and thus 𝑨[2]superscript𝑨delimited-[]2{\bm{A}}^{[2]}bold_italic_A start_POSTSUPERSCRIPT [ 2 ] end_POSTSUPERSCRIPT has a shape of 1056×1056105610561056\times 10561056 × 1056. We only take a subset of 150×150150150150\times 150150 × 150 in order to have a clearer visualization. Recall 𝐰=𝐡𝐰𝑨𝐰subscript𝐡𝐰𝑨{\mathbf{w}}={\mathbf{h}}_{{\mathbf{w}}}{\bm{A}}bold_w = bold_h start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT bold_italic_A, and thus rows correspond to dimensions in 𝐡𝐰subscript𝐡𝐰{\mathbf{h}}_{\mathbf{w}}bold_h start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT and columns correspond to dimensions in 𝐰𝐰{\mathbf{w}}bold_w.

It can be seen that when the bitrate is high, many rows in 𝑨𝑨{\bm{A}}bold_italic_A are active, enabling a flexible model. Conversely, at lower bitrates, many rows become 0, effectively pruning out corresponding dimensions. This explains clearly how 𝑨𝑨{\bm{A}}bold_italic_A contributes to improve the performance: first, 𝑨𝑨{\bm{A}}bold_italic_A greatly promotes parameter sharing. For instance, at low bitrates, merely 10 percent of the parameters get involved in constructing the entire network. Second, the pruning in 𝐡𝐰subscript𝐡𝐰{\mathbf{h}}_{\mathbf{w}}bold_h start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT is more efficient than that in 𝐰𝐰{\mathbf{w}}bold_w directly. The predecessor of recombiner, i.e., combiner, utilizes standard Bayesian neural networks. It controls its bitrates by pruning or activating the hidden units. When a unit is pruned, the entire column in the weight matrix will be pruned out (Trippe & Turner, 2017). In other words, in combiner, the pruning in 𝐰𝐰{\mathbf{w}}bold_w is always conducted in chunks, which highly limits the flexibility of the network. On the contrary, in our approach, the linear reparameterization enables a direct pruning or activating of each dimension in 𝐡𝐰subscript𝐡𝐰{\mathbf{h}}_{\mathbf{w}}bold_h start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT individually, ensuring the flexibility of inr while effectively managing the rate.

Another interesting observation is the matrix 𝑨𝑨{\bm{A}}bold_italic_A essentially learns a low-rank pattern without manual tuning. This is in comparison with VC-INR (Schwarz et al., 2023) where the low-rank pattern is explicitly enforced by manually setting the LoRA-style (Hu et al., 2021) modulation.

D.2 Effectiveness of Random Permutation

In this section, we provide an example illustrating the effectiveness of random permutation across patches. Specifically, we encode kodim23 at 0.074 bpp, both with and without random permutation, and visualize their residual images in Figure 8. We can see that, without permutation, the residuals for complex patches are significantly larger than simpler patches. This is due to the fact that, in recombiner, the bits allocated to each patch are merely determined by the number of blocks, which is shared across all the patches. On the other hand, after the permutation, we can see a more balanced distribution of residuals across patches: complex patches achieve better reconstructions, whereas simple patches’ performances only degrade marginally. This is because, after the permutation across patches, each block can have different patches’ representations, enabling an adaptive allocation of bits across patches. Overall, random permutation yields a 1.00 dB gain on this image.

Refer to caption
(a) with permutation, PSNR 29.16 dB
Refer to caption
(b) without permutation, PSNR 28.16 dB
Figure 8: Comparison of residuals of kodim23 at 0.074 bpp, with and without random permutation across patches.

D.3 Influence of Sample Size

As discussed in Section C.1, in our experiments, we use 5 samples to estimate the expectation in the β𝛽\betaitalic_β-ELBO in Equation 1, when inferring the posterior of a test datum. Here, we provide the RD curve using 1, 5 and 10 samples, on 500 randomly selected Cifar-10 test images and kodim03 as examples, to illustrate the influence of different choices of sample sizes.

As shown in Figure 9, the sample size mainly impacts the performance at high bitrates. Besides, further increasing the sample size to 10 only brings a minor improvement. Therefore, we choose 5 samples in our experiments to balance between encoding time and performance. It is also worth noting that using just 1 sample does not significantly reduce the performance. Therefore, we have the flexibility of choosing smaller sample sizes when prioritizing encoding time, with minor performance impacts.

Refer to caption
(a) CIFAR-10
Refer to caption
(b) kodim03
Figure 9: Influence of Sample size. (a) RD curve evaluated on 500 randomly selected CIFAR-10 images. (b) RD curve evaluated on kodim03.

D.4 Robustness during Training

Different from previous INR-based codecs based on MAML (Finn et al., 2017) including COIN++ (Dupont et al., 2022), MSCN (Schwarz & Teh, 2022) and VC-INR (Schwarz et al., 2023), our proposed recombiner does not require nested gradient descent and thus features higher stability during training period.

To demonstrate this advantage, we present a visualization of the average β𝛽\betaitalic_β-ELBO during training on CIFAR-10 across three bitrates in Figure 10. We can see that the training curves exhibit an initial dip followed by a consistent increase. The dip at the beginning is a result of our adjustment of β𝛽\betaitalic_β during training (Step 3 in Algorithm 1). Importantly, this adjustment does not impact training robustness; and we can see that β𝛽\betaitalic_β is quickly adjusted, and the training proceeds smoothly.

Refer to caption
(a) β𝛽\betaitalic_β-ELBO w.r.t. training steps.
Refer to caption
(b) Zoom-in plot.
Figure 10: Average training β𝛽\betaitalic_β-ELBO on Cifar-10 at three different bitrates. The initial dip is because we also adjust β𝛽\betaitalic_β during training to ensure the coding budget (Step 3 in Algorithm 1). We can see the initial β𝛽\betaitalic_β quickly adjusts in the first several steps, and then the training proceeds smoothly.

D.5 Coding Time

In this section, we provide details regarding the encoding and decoding time of recombiner. The encoding speed is measured on a single NVIDIA A100-SXM-80GB GPU. On CIFAR-10 and protein structures, we compress signals in batch, with a batch size of 500 images and 1,000 structures, respectively. On Kodak, audio, and video datasets, we compress each signal separately. We should note that the batch size does not influence the results. Posteriors of signals within one batch are optimized in parallel, and their gradients are not crossed. The decoding speed is measured per signal on CPU.

Similar to COMBINER, our approach features a high encoding time complexity. However, the decoding process is remarkably fast, even on CPU, matching the speed of COIN and COMBINER. Note that the decoding time listed here encompasses the retrieval of samples for each block. In practical applications, this process can be implemented and parallelized using lower-level languages such as C++ or C, which can lead to further acceleration of execution.

Bitrate Encoding Time (GPU, 500 instances) Decoding Time (CPU, per instance)
0.297 bpp similar-to\sim63 min 0.00386 s
0.719 bpp similar-to\sim65 min 0.00429 s
0.938 bpp similar-to\sim68 min 0.00461 s
1.531 bpp similar-to\sim72 min 0.00514 s
1.922 bpp similar-to\sim75 min 0.00581 s
3.344 bpp similar-to\sim87 min 0.00776 s
4.391 bpp similar-to\sim93 min 0.01050 s
Table 3: Coding time for CIFAR-10.
Bitrate Encoding Time (GPU, per instance, 96 patches) Decoding Time (CPU, per instance)
0.074 bpp similar-to\sim59 min 0.25848 s
0.130 bpp similar-to\sim64 min 0.29117 s
0.178 bpp similar-to\sim67 min 0.30875 s
0.316 bpp similar-to\sim72 min 0.29690 s
0.488 bpp similar-to\sim80 min 0.34237 s
0.972 bpp similar-to\sim92 min 0.41861 s
Table 4: Coding time for Kodak.
Bitrate Encoding Time (GPU, per instance, 50 patches) Decoding Time (CPU, per instance)
5.69 kbps similar-to\sim18 min 0.05564 s
10.66 kbps similar-to\sim21 min 0.06003 s
22.11 kbps similar-to\sim22 min 0.06166 s
43.64 kbps similar-to\sim22 min 0.07350 s
Table 5: Coding time for Audio.
Bitrate Encoding Time (GPU, per instance, 64 patches) Decoding Time (CPU, per instance)
0.115 bpp similar-to\sim49 min 0.31936 s
0.244 bpp similar-to\sim62 min 0.33416 s
0.605 bpp similar-to\sim78 min 0.33448 s
1.183 bpp similar-to\sim102 min 0.35665 s
Table 6: Coding time for Video.
Bitrate Encoding Time (GPU, 1000 instance) Decoding Time (CPU, per instance)
11.17 bpa similar-to\sim72 min 0.00704 s
35.17 bpa similar-to\sim123 min 0.00948 s
60.67 bpa similar-to\sim175 min 0.01429 s
83.83 bpa similar-to\sim226 min 0.01778 s
106.17 bpa similar-to\sim274 min 0.02014 s
Table 7: Coding time for Protein.

Appendix E Things We Tried that Did Not Work

  • in recombiner, we apply linear reparameterization on inr weights, which transfers the weights linearly into a transformed space. Perhaps a natural extension is to apply more complex transformations, e.g., neural networks, or flows. We experimented with this idea, but it did not provide gains over the linear transformation.

  • in recombiner, we propose a hierarchical Bayesian model, equivalent to assigning hierarchical hyper-priors and inferring the hierarchical posteriors over the means of the inr weights. A natural extension can be assigning hyper-priors/posteriors to both means and variances. But we did not find any gain by this.

  • in recombiner, the hierarchical Bayesian model is only applied to the latent inr weights 𝐡𝐰subscript𝐡𝐰{\mathbf{h}}_{\mathbf{w}}bold_h start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT. It is natural to apply the same hierarchical structure to the latent positional encodings 𝐡𝐳subscript𝐡𝐳{\mathbf{h}}_{\mathbf{z}}bold_h start_POSTSUBSCRIPT bold_z end_POSTSUBSCRIPT. However, we found it does not provide visible gain.

Appendix F More RD Curves

Here, we show the full-resolution RD curves for image compression in Figures 11 and 12. Besides, we also provide a further comparison between recombiner with combiner on 24 test audio clips from LibriSpeech in Figure 13.

Refer to caption
Figure 11: RD curve on CIFAR-10.
Refer to caption
Figure 12: RD curve on Kodak.
Refer to caption
Figure 13: RD curve of MP3, combiner and recombiner on 24 test audio clips from LibriSpeech test set.

Appendix G RD Values

CIFAR-10:

rate = [0.297, 0.719, 0.938, 1.531, 1.922, 3.344, 4.391]

PSNR = [23.592, 27.222, 28.505, 30.911, 32.168, 35.732, 38.139]

Kodak:

rate = [0.074, 0.130, 0.178, 0.316, 0.488, 0.972, 1.567, 3.320]

PSNR = [26.158, 27.653, 28.594, 30.439, 31.953, 34.540, 36.547, 40.426]

Audio:

On full test set:

rate = [5.685, 10.661, 22.112, 43.637]

PSNR = [42.612, 47.101, 52.196, 58.195]

On 24 test examples (to compare with COMBINER):

rate = [5.168, 10.805, 22.112, 43.637]

PSNR = [42.789, 47.106, 52.206, 58.327]

Video:

rate = [0.115, 0.244, 0.605, 1.183]

PSNR = [28.722, 31.494, 35.717, 39.171]

Protein:

rate = [11.17, 35.17, 60.67, 83.83, 106.17]

RMSD = [0.9242, 0.1388, 0.0709, 0.0506, 0.0436]

Appendix H More Decoded Examples

H.1 CIFAR-10

Refer to caption
Figure 14: Decoded CIFAR-10 images and residuals.

H.2 Kodak

Refer to caption
(a) Decoded images and residuals of kodim01.
Refer to caption
(b) Decoded images and residuals of kodim23.
Figure 15: Examples of decoded Kodak images and their residuals.

H.3 Audio

Decoded Audios Ground Truth
5.17 kbps, 46.78 dB 10.81 kbps, 51.53 dB 22.11 kbps, 56.45 dB
here here here here
Table 8: Decoded audio examples.

H.4 Video

Refer to caption
Figure 16: Examples of decoded videos and residuals. Animation visualization is available here.

H.5 Protein Structure

Refer to caption
(a) Example 1. 3D view is available at here.
Refer to caption
(b) Example 2. 3D view is available at here.
Figure 17: Examples of decoded protein structures and their ground truths.