11institutetext: Shanghai Jiao Tong University 22institutetext: Monash University
22email: {yhchen.ee, wylin}@sjtu.edu.cn,
{qianyi.wu, jianfei.cai, mehrtash.harandi}@monash.edu

HAC: Hash-grid Assisted Context for 3D Gaussian Splatting Compression

Yihang Chen 11 2 2    Qianyi Wu 22    Jianfei Cai 22   
Mehrtash Harandi
22
   Weiyao Lin 11
Abstract

3D Gaussian Splatting (3DGS) has emerged as a promising framework for novel view synthesis, boasting rapid rendering speed with high fidelity. However, the substantial Gaussians and their associated attributes necessitate effective compression techniques. Nevertheless, the sparse and unorganized nature of the point cloud of Gaussians (or anchors in our paper) presents challenges for compression. To address this, we make use of the relations between the unorganized anchors and the structured hash grid, leveraging their mutual information for context modeling, and propose a Hash-grid Assisted Context (HAC) framework for highly compact 3DGS representation. Our approach introduces a binary hash grid to establish continuous spatial consistencies, allowing us to unveil the inherent spatial relations of anchors through a carefully designed context model. To facilitate entropy coding, we utilize Gaussian distributions to accurately estimate the probability of each quantized attribute, where an adaptive quantization module is proposed to enable high-precision quantization of these attributes for improved fidelity restoration. Additionally, we incorporate an adaptive masking strategy to eliminate invalid Gaussians and anchors. Importantly, our work is the pioneer to explore context-based compression for 3DGS representation, resulting in a remarkable size reduction of over 75×75\times75 × compared to vanilla 3DGS, while simultaneously improving fidelity, and achieving over 11×11\times11 × size reduction over SOTA 3DGS compression approach Scaffold-GS. Our code is available here.

Abstract

This is the supplementary material for our paper. Herein, we offer more details of implementations, an extra experiment, quantitative per-scene results across all datasets, and a comprehensive notation table.

Keywords:
3D Gaussian Splatting Compression Context Models
Refer to caption
Figure 1: Top: A toy example where our method makes the size of the vanilla 3D Gaussian splitting (3DGS) model 72×72\times72 × smaller (or 9.49×9.49\times9.49 × smaller compared to the SoTA Scaffold-GS [25]), while with similar or better fidelity. Bottom: Most existing 3DGS compression methods concentrate solely on parameter “values” using pruning or vector quantization to reduce the size, ignoring the structure relations among nearby 3D Gaussians. Scaffold-GS [25] introduces anchors to cluster and neural-predict the associated 3D Gaussians while treating each anchor point independently. Our core idea is to further exploit the inherent consistencies of anchors via a structured hash grid for a more compact 3DGS representation.

1 Introduction

Over the past few years, significant advancements have been made in 3D scene representations for novel view synthesis. Neural Radiance Field (NeRF) [26] proposes rendering colors by accumulating RGB values along sampling rays using an implicit Multilayer Perceptron (MLP), aiming at reconstructing photo-realistic images. However, the extensive sampling of ray points has been a bottleneck, affecting both the speed of training and rendering. Recent advances of NeRF [28, 5, 11] introduce feature grids to enhance the rendering process, facilitating faster rendering speeds by reducing the MLP size. Despite the improvement, these approaches still suffer from relatively slow rendering speeds due to frequent ray point sampling.

In this context, very recently, a new paradigm of 3D representation, 3D Gaussian Splatting (3DGS) [17], emerged. 3DGS introduces learnable Gaussians to directly represent 3D space explicitly. These Gaussians, initialized from Structure-from-Motion (SfM) [33] and endowed with learnable shape and appearance parameters, can be directly splatted onto 2D planes for rapid and differentiable rendering within imperceptible intervals using tile-based rasterization [19]. As such, the time-consuming volume rendering used in NeRF can be completely removed. The advantages of rapid differentiable rendering with high photo-realistic fidelity have stimulated the fast and widespread adoption of 3DGS in the field.

However, 3DGS is not the ultimate solution. One major drawback is that it requires a considerable number of 3D Gaussians to well represent a large-scale scene (e.g., at the scale of millions of Gaussians for city-scale scenes) and needs a large storage space (e.g., a few GigaBytes (GB)) to store the associated Gaussian attributes for each scene [40]. This motivates us to investigate effective compression techniques for 3DGS.

Due to their sparse and unorganized nature, compressing 3D Gaussians is challenging and difficult. Therefore, most existing 3DGS compression approaches focus solely on parameter “values” but overlook their structural relations. For example, as illustrated in Fig. 1 middle, parameter pruning can be used to mask out the Gaussians whose parameter values are below a certain threshold [20, 10]. Another straightforward technique is to apply vector quantization to cluster parameters with similar “values”. Such an approach enables the direct compression of parameters by only retaining more representative ones while maintaining reconstruction fidelity [20, 30, 29, 10]. Nevertheless, solely concentrating on “values” fails to eliminate structural redundancies, which are pivotal for compact representations. To exploit such spatial relations of Gaussians, Scaffold-GS [25] introduces anchors to cluster related nearby 3D Gaussians and neural-predict their attributes from the anchors’ attributes, resulting in significant storage savings. Despite the improvement, Scaffold-GS still treats each anchor independently, and there are still substantial anchors that are sparse, unorganized, and hard to compress, due to their point-cloud nature.

To further push the boundary of 3DGS compression, we draw inspiration from the NeRF series [26], contemplating the idea of representing 3D space using well-organized feature grids [28, 5]. We pose the question: Is there inherent relations between the attributes of unorganized anchors in Scaffold-GS and the structured feature grids? Our answer is affirmative since we observe large mutual information between anchor attributes and the hash grid features. Based on this observation, we propose a Hash-grid Assisted Context (HAC) framework, where our core idea is to jointly learn structured compact hash grid (binarized for each hash parameter) and use it for context modeling of anchor attributes. Specifically, with Scaffold-GS [25] as our base model, for each anchor, we query the hash grid by the anchor location to obtain an interpolated hash feature, which is then used to predict the value distributions of anchor attributes, facilitating entropy coding such as Arithmetic Coding (AE) [39] for a highly compact representation of the model. Note that we employ Scaffold-GS as our base model as its anchor-centered design provides a good foundation to establish relations with these interpolated hash features. Furthermore, we introduce an Adaptive Quantization Module (AQM), which dynamically adjusts different quantization step sizes for different anchor attributes for retaining of their original information. Learnable masks are also employed to mask out invalid Gaussians and anchors, further enhancing the compression ratio. Our main contributions can be summarized as follows:

  1. 1.

    To our knowledge, we are the first to model contexts for 3DGS compression, i.e., using a structured hash grid to exploit the inherent consistencies among unorganized 3D Gaussians (or anchors in Scaffold-GS).

  2. 2.

    To facilitate efficient entropy encoding of anchor attributes, we propose to use the interpolated hash feature to neural-predict the value distribution of anchor attributes as well as neural-predicting quantization step refinement with AQM. We also employ learnable masks to prune out ineffective Gaussians and anchors.

  3. 3.

    Extensive experiments on five datasets demonstrate the effectiveness of our HAC framework and each technical component. We achieve a compression ratio of 11×11\times11 × over our base model Scaffold-GS and 75×75\times75 × over the vanilla 3DGS model when averaged over all datasets, while with comparable or even improved fidelity.

2 Related Work

Neural Radiance Field and its compression. The emergence of Neural Radiance Field (NeRF)[26] has significantly advanced novel view synthesis by employing a single learnable implicit MLP to generate arbitrary views of 3D scenes through α𝛼\alphaitalic_α-composed accumulation of RGB values along a ray. However, the dense querying of sampling points and the utilization of a large MLP hinder real-time rendering. To address this problem, subsequent approaches such as Instant-NGP[28], TensoRF [5], K-planes [11], and DVGO [37] adopt explicit grid-based representations to facilitate faster training and rendering by reducing the size of the MLP, which however comes at the cost of increased storage space.

To mitigate the storage increase, compression techniques focusing on reducing the size of explicit representations have been developed, which can be categorized into either “value”-based or structural-relation-based approaches. The former category includes pruning [23, 9], codebooks [23, 24], and methods like quantization or entropy constraint employed in BiRF [35] and SHACIRA [13]. On the other hand, the latter category explores structural relations via wavelet decomposition [32], rank-residual decomposition [38], or spatial prediction [36] to eliminate spatial redundancy, thanks to the well-structured characteristics of these feature grids. CNC [6] provides a solid proof of concept by sufficiently utilizing such structural information, achieving remarkable RD performance gain.

3D Gaussian Splatting and its compression. 3DGS [17] has innovatively addressed the challenge of slow training and rendering in NeRF while maintaining high-fidelity quality by representing 3D scenes with 3D Gaussians endowed with learnable shape and appearance attributes. By adopting differentiable splatting and tile-based rasterization [19], 3D Gaussians are optimized during training to best fit their local 3D regions. Despite its advantages, the substantial Gaussians and their associated attributes necessitate effective compression techniques.

Unlike NeRF-based feature grids, 3D Gaussians in 3DGS are sparse and unorganized, presenting significant challenges for establishing structural relations. Consequently, compression approaches have primarily focused solely on the “value” of model parameters, employing techniques such as pruning [20, 10], codebooks [20, 30, 10, 29], and entropy constraints [12]. To our knowledge, Scaffold-GS [25] and Morgenstern et al. [27] have explored the relations of Gaussians. In [25], authors introduce anchor-centered features to achieve reduced parameter numbers, while in [27] dimension collapsing is considered to compress Gaussians in an ordered 2D space. However, their investigation of spatial redundancy remains insufficient.

In this paper, we emphasize leveraging such structural relations for compression is crucial. For instance, approaches in image compression [7, 15, 14] and video compression [21, 22, 34] have demonstrated the effectiveness of eliminating structural redundancy by excavating spatial and temporal relations, thanks to their well-organized data structure. Motivated by this, with Scaffold-GS as our base model, we introduce a well-structured hash grid as context to model the inherent consistencies of the sparse and unorganized anchors, achieving much more compact 3DGS representation.

3 Methods

In Fig. 2, we conceptualize our HAC framework. In particular, HAC is based on the baseline Scaffold-GS [25] (Fig. 2 top), which introduces anchors with their attributes 𝒜𝒜\mathcal{A}caligraphic_A (feature, scaling and offsets) to cluster and neural-predict 3D Gaussian attributes (opacity, RGB, scale, and quaternion). At the core of our HAC, we propose to jointly learn structured compact hash grid (binarized for each parameter) that can be queried at any anchor location to obtain the interpolated hash feature 𝒇hsuperscript𝒇\bm{f}^{h}bold_italic_f start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT (Fig. 2 middle). Instead of directly substituting anchor feature, 𝒇hsuperscript𝒇\bm{f}^{h}bold_italic_f start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT is used as context to predict the value distributions of anchor attributes, which is essential for the subsequent entropy coding, i.e., Arithmetic Coding (AE). Our context model (Fig. 2 bottom) is a simple MLP that takes 𝒇hsuperscript𝒇\bm{f}^{h}bold_italic_f start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT as input and outputs 𝒓𝒓{\bm{r}}bold_italic_r for the adaptive quantization module (AQM) (quantize anchor attribute values into a finite set) and the Gaussian parameters (𝝁𝝁\bm{\mu}bold_italic_μ and 𝝈𝝈\bm{\sigma}bold_italic_σ) for modeling the value distributions of anchor attributes, from which we can compute the probability of each quantized attribute value for AE. Note that we draw two MLPs (MLPqsubscriptMLPq{\rm MLP_{q}}roman_MLP start_POSTSUBSCRIPT roman_q end_POSTSUBSCRIPT and MLPcsubscriptMLPc{\rm MLP_{c}}roman_MLP start_POSTSUBSCRIPT roman_c end_POSTSUBSCRIPT) in Fig. 2 for easy explanation but they actually share the same MLP layers with outputs at different dimensions. Besides, an adaptive offset masking module (Fig. 2 top-left) is adopted to prune redundant Gaussians and anchors. In the following, we first introduce the background and then delve into the detailed technical components of our HAC.

Refer to caption
Figure 2: Overview of our HAC framework. It is based on Scaffold-GS [25] (top), which introduces anchors with their attributes to neural-predict 3D Gaussian attributes. Middle: Our HAC framework jointly learns structured compact hash grid (binarized for each parameter) that can be queried at any anchor location to obtain the interpolated hash feature 𝒇hsuperscript𝒇\bm{f}^{h}bold_italic_f start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT. Instead of direct substitution, 𝒇hsuperscript𝒇\bm{f}^{h}bold_italic_f start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT is used as context to predict the value distributions of anchor attributes, which is essential for the subsequent entropy coding. Bottom: Our proposed context models take 𝒇hsuperscript𝒇\bm{f}^{h}bold_italic_f start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT as input and outputs 𝒓𝒓{\bm{r}}bold_italic_r for the AQM (quantize anchor attribute values into a finite set) and the parameters (𝝁𝝁\bm{\mu}bold_italic_μ and 𝝈𝝈\bm{\sigma}bold_italic_σ) to model the value distributions of anchor attributes.

3.1 Preliminaries

3D Gaussian Splatting (3DGS) [17] represents a 3D scene using numerous Gaussians and renders viewpoints through a differentiable splatting and tile-based rasterization. Each Gaussian is initialized from SfM and defined by a 3D covariance matrix 𝚺3×3𝚺superscript33\bm{\Sigma}\in\mathbb{R}^{3\times 3}bold_Σ ∈ blackboard_R start_POSTSUPERSCRIPT 3 × 3 end_POSTSUPERSCRIPT and location (mean) 𝝁3𝝁superscript3\bm{\mu}\in\mathbb{R}^{3}bold_italic_μ ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT,

G(𝒙)=exp(12(𝒙𝝁)𝚺1(𝒙𝝁)),𝐺𝒙12superscript𝒙𝝁topsuperscript𝚺1𝒙𝝁G(\bm{x})=\exp{\left(-\frac{1}{2}(\bm{x}-\bm{\mu})^{\top}\bm{\Sigma}^{-1}(\bm{% x}-\bm{\mu})\right)}\;,italic_G ( bold_italic_x ) = roman_exp ( - divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( bold_italic_x - bold_italic_μ ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_Σ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( bold_italic_x - bold_italic_μ ) ) , (1)

where 𝒙3𝒙superscript3\bm{x}\in\mathbb{R}^{3}bold_italic_x ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT is a random 3D point, and 𝚺𝚺\bm{\Sigma}bold_Σ is defined by a diagonal matrix 𝑺3×3𝑺superscript33\bm{S}\in\mathbb{R}^{3\times 3}bold_italic_S ∈ blackboard_R start_POSTSUPERSCRIPT 3 × 3 end_POSTSUPERSCRIPT representing scaling and rotation matrix 𝑹3×3𝑹superscript33\bm{R}\in\mathbb{R}^{3\times 3}bold_italic_R ∈ blackboard_R start_POSTSUPERSCRIPT 3 × 3 end_POSTSUPERSCRIPT to guarantee its positive semi-definite characteristics, such that 𝚺=𝑹𝑺𝑺𝑹𝚺𝑹𝑺superscript𝑺topsuperscript𝑹top\bm{\Sigma}=\bm{R}\bm{S}\bm{S}^{\top}\bm{R}^{\top}bold_Σ = bold_italic_R bold_italic_S bold_italic_S start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_R start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT. To render an image from a random viewpoint, 3D Gaussians are first splatted to 2D, and render the pixel value 𝑪3𝑪superscript3\bm{C}\in\mathbb{R}^{3}bold_italic_C ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT using α𝛼\alphaitalic_α-composed blending,

𝑪=iI𝒄iαij=1i1(1αj)𝑪subscript𝑖𝐼subscript𝒄𝑖subscript𝛼𝑖superscriptsubscriptproduct𝑗1𝑖11subscript𝛼𝑗\bm{C}=\sum_{i\in I}{\bm{c}_{i}\alpha_{i}\prod_{j=1}^{i-1}\left(1-\alpha_{j}% \right)}bold_italic_C = ∑ start_POSTSUBSCRIPT italic_i ∈ italic_I end_POSTSUBSCRIPT bold_italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∏ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i - 1 end_POSTSUPERSCRIPT ( 1 - italic_α start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) (2)

where α1𝛼superscript1\alpha\in\mathbb{R}^{1}italic_α ∈ blackboard_R start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT measures the opacity of each Gaussian after 2D projection, 𝒄3𝒄superscript3\bm{c}\in\mathbb{R}^{3}bold_italic_c ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT is view-dependent color modeled by Spherical Harmonic (SH) coefficients, and I𝐼Iitalic_I is the number of sorted Gaussians contributing to the rendering.

Scaffold-GS [25] adheres to the framework of 3DGS and introduces a more storage-friendly and fidelity-satisfying anchor-based approach. It utilizes anchors to cluster Gaussians and deduce their attributes from the attributes of attached anchors through MLPs, rather than directly storing them. Specifically, each anchor consists of a location 𝒙a3superscript𝒙𝑎superscript3\bm{x}^{a}\in\mathbb{R}^{3}bold_italic_x start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT and anchor attributes 𝒜={𝒇aDa,𝒍6,𝒐3K}𝒜formulae-sequencesuperscript𝒇𝑎superscriptsuperscript𝐷𝑎formulae-sequence𝒍superscript6𝒐superscript3𝐾\mathcal{A}=\{\bm{f}^{a}\in\mathbb{R}^{D^{a}},\bm{l}\in\mathbb{R}^{6},\bm{o}% \in\mathbb{R}^{3K}\}caligraphic_A = { bold_italic_f start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_D start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT , bold_italic_l ∈ blackboard_R start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT , bold_italic_o ∈ blackboard_R start_POSTSUPERSCRIPT 3 italic_K end_POSTSUPERSCRIPT }, where each component represents anchor feature, scaling and offsets, respectively. During rendering, 𝒇asuperscript𝒇𝑎\bm{f}^{a}bold_italic_f start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT is inputted into MLPs to generate attributes for Gaussians, whose locations are determined by adding 𝒙asuperscript𝒙𝑎\bm{x}^{a}bold_italic_x start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT and 𝒐𝒐\bm{o}bold_italic_o, where 𝒍𝒍\bm{l}bold_italic_l is utilized to regularize both locations and shapes of the Gaussians. While Scaffold-GS has demonstrated effectiveness via this anchor-centered design, we contend there is still significant redundancy among inherent consistencies of anchors that we can fully exploit for a more compact 3DGS representation.

3.2 Bridging Anchors and Hash Grid

Our main idea is to leverage the well-structured hash grid to unveil the inherent spatial consistencies of the unorganized anchors. To verify their mutual information, we first explore substituting anchor features 𝒇asuperscript𝒇𝑎\bm{f}^{a}bold_italic_f start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT with hash features 𝒇hsuperscript𝒇\bm{f}^{h}bold_italic_f start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT that are acquired by interpolation using the anchor location 𝒙asuperscript𝒙𝑎\bm{x}^{a}bold_italic_x start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT on the hash grid \mathcal{H}caligraphic_H, defined as 𝒇h:=Interp(𝒙a,)assignsuperscript𝒇Interpsuperscript𝒙𝑎\bm{f}^{h}:={\rm Interp}(\bm{x}^{a},\mathcal{H})bold_italic_f start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT := roman_Interp ( bold_italic_x start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT , caligraphic_H ). Here, ={𝜽ilDh|i=1,,Tl|l=1,,L}conditional-setsuperscriptsubscript𝜽𝑖𝑙superscriptsuperscript𝐷formulae-sequence𝑖1conditionalsuperscript𝑇𝑙𝑙1𝐿\mathcal{H}=\{\bm{\theta}_{i}^{l}\in\mathbb{R}^{D^{h}}|i=1,\dots,T^{l}|l=1,% \dots,L\}caligraphic_H = { bold_italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_D start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT | italic_i = 1 , … , italic_T start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT | italic_l = 1 , … , italic_L } represents the hash gird, where Dhsuperscript𝐷D^{h}italic_D start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT is the dimension of vector 𝜽ilsuperscriptsubscript𝜽𝑖𝑙\bm{\theta}_{i}^{l}bold_italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT, Tlsuperscript𝑇𝑙T^{l}italic_T start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT is the table size of the grid for level l𝑙litalic_l, and L𝐿Litalic_L is the total number of levels. We conduct a preliminary experiment on the Synthetic-NeRF dataset [26] to assess its performance, as shown in the right panel of Fig. 3. Direct substitution using hash features appears to yield inferior fidelity and introduces drawbacks such as unstable training (due to its impact on anchor spawning processes) and decreased testing FPS (owing to the extra interpolation operation). These results may further degrade if 𝒍𝒍\bm{l}bold_italic_l and 𝒐𝒐\bm{o}bold_italic_o are also substituted for a more compact model. Nonetheless, we find the fidelity degradation remains moderate, suggesting the existence of rich mutual information between 𝒇hsuperscript𝒇\bm{f}^{h}bold_italic_f start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT and 𝒇asuperscript𝒇𝑎\bm{f}^{a}bold_italic_f start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT. This prompts us to ask: Can we exploit such mutual relation and use the compact hash features to model the context of anchor attributes 𝒜𝒜\mathcal{A}caligraphic_A? This leads to the context modeling as a conditional probability:

p(𝒜,𝒙a,)=p(𝒜|𝒙a,)×p(𝒙a,)p(𝒜|𝒇h)×p()𝑝𝒜superscript𝒙𝑎𝑝conditional𝒜superscript𝒙𝑎𝑝superscript𝒙𝑎similar-to𝑝conditional𝒜superscript𝒇𝑝p(\mathcal{A},\bm{x}^{a},\mathcal{H})=p(\mathcal{A}|\bm{x}^{a},\mathcal{H})% \times p(\bm{x}^{a},\mathcal{H})\sim p(\mathcal{A}|\bm{f}^{h})\times p(% \mathcal{H})italic_p ( caligraphic_A , bold_italic_x start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT , caligraphic_H ) = italic_p ( caligraphic_A | bold_italic_x start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT , caligraphic_H ) × italic_p ( bold_italic_x start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT , caligraphic_H ) ∼ italic_p ( caligraphic_A | bold_italic_f start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ) × italic_p ( caligraphic_H ) (3)

where 𝒙asuperscript𝒙𝑎\bm{x}^{a}bold_italic_x start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT is omitted in the last term as we assume the independence of 𝒙asuperscript𝒙𝑎\bm{x}^{a}bold_italic_x start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT and \mathcal{H}caligraphic_H (it can be anywhere), making p(|𝒙a)p()similar-to𝑝conditionalsuperscript𝒙𝑎𝑝p(\mathcal{H}|\bm{x}^{a})\sim p(\mathcal{H})italic_p ( caligraphic_H | bold_italic_x start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT ) ∼ italic_p ( caligraphic_H ), and do not employ entropy constraints to 𝒙asuperscript𝒙𝑎\bm{x}^{a}bold_italic_x start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT. According to information theory [8], a higher probability corresponds to lower uncertainty (entropy) and fewer bits consumption. Thus, the large mutual information between 𝒜𝒜\mathcal{A}caligraphic_A and 𝒇hsuperscript𝒇\bm{f}^{h}bold_italic_f start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ensures a large p(𝒜|𝒇h)𝑝conditional𝒜superscript𝒇p(\mathcal{A}|\bm{f}^{h})italic_p ( caligraphic_A | bold_italic_f start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ). Our goal is to devise a solution to effectively leverage this relationship. Furthermore, p()𝑝p(\mathcal{H})italic_p ( caligraphic_H ) signifies that the size of the hash grid itself should also be compressed, which can be done by adopting the existing solution for Instant-NGP compression [6].

We underscore the significance of this conditional probability based approach since it ensures both rendering speed and fidelity upper-bound unaffected as it only utilizes hash features to estimate the entropy of anchor attributes for entropy coding but does not modify the original Scaffold-GS structure. In the following subsections, we delve into the technical details of our context models.

Refer to caption
Synthetic-NeRF [26] PSNR\uparrow SSIM\uparrow LPIPS\downarrow
3DGS [17] 33.80 0.970 0.031
Scaffold-GS [25] 33.41 0.966 0.035
Substituting 𝒇asuperscript𝒇𝑎\bm{f}^{a}bold_italic_f start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT with 𝒇hsuperscript𝒇\bm{f}^{h}bold_italic_f start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT 32.85 0.963 0.041
Figure 3: Left chart: Statistical analysis of the value distributions of 𝒜𝒜\mathcal{A}caligraphic_A on the scene “chair” of the Synthetic-NeRF dataset [26]. All three components {𝒇a\{\bm{f}^{a}{ bold_italic_f start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT, 𝒍,𝒐}\bm{l},\bm{o}\}bold_italic_l , bold_italic_o } exhibit statistical Gaussian distributions. Note that the values of 𝒍𝒍\bm{l}bold_italic_l are scaled by a factor of 100 for better visualization. Right table: Experimental results of directly substituting anchor feature 𝒇asuperscript𝒇𝑎\bm{f}^{a}bold_italic_f start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT with hash feature 𝒇hsuperscript𝒇\bm{f}^{h}bold_italic_f start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT on this dataset.

3.3 HAC: Hash-Grid Assisted Context Framework

The principle objective of HAC is to minimize the entropy of anchor attributes 𝒜𝒜\mathcal{A}caligraphic_A with the assistance of hash feature 𝒇hsuperscript𝒇\bm{f}^{h}bold_italic_f start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT (i.e., maximize p(𝒜|𝒇h)𝑝conditional𝒜superscript𝒇p(\mathcal{A}|\bm{f}^{h})italic_p ( caligraphic_A | bold_italic_f start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT )), facilitating bit reduction when encoding anchor attributes using entropy coding like AE[39]. As shown in Fig. 2, anchor locations 𝒙asuperscript𝒙𝑎\bm{x}^{a}bold_italic_x start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT are firstly inputted into the hash grid for interpolation, the obtained 𝒇hsuperscript𝒇\bm{f}^{h}bold_italic_f start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT are then employed as context for 𝒜𝒜\mathcal{A}caligraphic_A.

Adaptive Quantization Module. To facilitate entropy coding, values of 𝒜𝒜\mathcal{A}caligraphic_A must be quantized to a finite set. Our empirical studies reveal that binarization, as that in BiRF [35], is unsuitable for 𝒜𝒜\mathcal{A}caligraphic_A as it fails to preserve sufficient information. Thus, we opt for rounding them to maintain their comprehensive features. To ensure backpropagation, we utilize the “adding noise” operation during training and “rounding” during testing, as described in [1].

Nevertheless, the conventional rounding is essentially a quantization with a step size of “1”, which is inappropriate for the scaling 𝒍𝒍\bm{l}bold_italic_l and the offset 𝒐𝒐\bm{o}bold_italic_o, since they are usually decimal values. To address this, we further introduce an Adaptive Quantization Module (AQM), which adaptively determines quantization steps. In particular, for the i𝑖iitalic_ith anchor 𝒙iasubscriptsuperscript𝒙𝑎𝑖\bm{x}^{a}_{i}bold_italic_x start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, we denote 𝒇isubscript𝒇𝑖\bm{f}_{i}bold_italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT as any of its 𝒜isubscript𝒜𝑖\mathcal{A}_{i}caligraphic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT’s components: 𝒇i{𝒇ia,𝒍i,𝒐i}Dsubscript𝒇𝑖subscriptsuperscript𝒇𝑎𝑖subscript𝒍𝑖subscript𝒐𝑖superscript𝐷\bm{f}_{i}\in{\{\bm{f}^{a}_{i},\bm{l}_{i},\bm{o}_{i}\}}\in\mathbb{R}^{D}bold_italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ { bold_italic_f start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_l start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } ∈ blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT, where D{Da,6,3K}𝐷superscript𝐷𝑎63𝐾D\in\{D^{a},6,3K\}italic_D ∈ { italic_D start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT , 6 , 3 italic_K } is its respective dimension. The quantization can be written as,

𝒇i^^subscript𝒇𝑖\displaystyle\hat{\bm{f}_{i}}over^ start_ARG bold_italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG =𝒇i+𝒰(12,12)×𝒒i,for trainingabsentsubscript𝒇𝑖𝒰1212subscript𝒒𝑖for training\displaystyle=\bm{f}_{i}+\mathcal{U}\left(-\frac{1}{2},\frac{1}{2}\right)% \times\bm{q}_{i},\qquad\;\;\;\text{for training}= bold_italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + caligraphic_U ( - divide start_ARG 1 end_ARG start_ARG 2 end_ARG , divide start_ARG 1 end_ARG start_ARG 2 end_ARG ) × bold_italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , for training (4)
=Round(𝒇i/𝒒i)×𝒒i,for testingabsentRoundsubscript𝒇𝑖subscript𝒒𝑖subscript𝒒𝑖for testing\displaystyle={\rm Round}(\bm{f}_{i}/\bm{q}_{i})\times\bm{q}_{i},\qquad\qquad% \,\text{for testing}= roman_Round ( bold_italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT / bold_italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) × bold_italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , for testing

where

𝒒isubscript𝒒𝑖\displaystyle\bm{q}_{i}bold_italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT =Q0×(1+Tanh(𝒓i))absentsubscript𝑄01Tanhsubscript𝒓𝑖\displaystyle=Q_{0}\times\left(1+{\rm Tanh}\left({\bm{r}}_{i}\right)\right)= italic_Q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT × ( 1 + roman_Tanh ( bold_italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) (5)
𝒓isubscript𝒓𝑖\displaystyle{\bm{r}}_{i}bold_italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT =MLPq(𝒇ih).absentsubscriptMLPqsubscriptsuperscript𝒇𝑖\displaystyle={\rm MLP_{q}}\left(\bm{f}^{h}_{i}\right).= roman_MLP start_POSTSUBSCRIPT roman_q end_POSTSUBSCRIPT ( bold_italic_f start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) .

We use a simple MLP-based context model MLPqsubscriptMLPq{\rm MLP_{q}}roman_MLP start_POSTSUBSCRIPT roman_q end_POSTSUBSCRIPT to predict from hash feature 𝒇ihsubscriptsuperscript𝒇𝑖\bm{f}^{h}_{i}bold_italic_f start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT a refinement 𝒓i1subscript𝒓𝑖superscript1{\bm{r}}_{i}\in\mathbb{R}^{1}bold_italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT, which is used to adjust the predefined quantization step size Q0subscript𝑄0Q_{0}italic_Q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. Note that Q0subscript𝑄0Q_{0}italic_Q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT varies for 𝒇asuperscript𝒇𝑎\bm{f}^{a}bold_italic_f start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT, 𝒍𝒍\bm{l}bold_italic_l, and 𝒐𝒐\bm{o}bold_italic_o. (5) essentially restricts the quantization step size 𝒒i1subscript𝒒𝑖superscript1\bm{q}_{i}\in\mathbb{R}^{1}bold_italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT to be chosen within (0,2Q0)02subscript𝑄0\left(0,2Q_{0}\right)( 0 , 2 italic_Q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ), enabling 𝒇^isubscript^𝒇𝑖\hat{\bm{f}}_{i}over^ start_ARG bold_italic_f end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to closely resemble the original characteristics of 𝒇isubscript𝒇𝑖\bm{f}_{i}bold_italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, maintaining a high rendering fidelity.

Gaussian Distribution Modeling. To measure the bit consumption of 𝒇^isubscript^𝒇𝑖\hat{\bm{f}}_{i}over^ start_ARG bold_italic_f end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT during training, its probability needs to be estimated in a differentiable manner. As shown in Fig. 3 left, all three components of anchor attributes 𝒜𝒜\mathcal{A}caligraphic_A exhibit statistical tendencies of Gaussian distributions, where 𝒍𝒍\bm{l}bold_italic_l displays a single-sided pattern due to Sigmoid activation111We define 𝒍𝒍\bm{l}bold_italic_l as the one after Sigmoid activation, which is slightly different from [25].. This observation establishes a lower bound for probability prediction when all 𝒇^isubscript^𝒇𝑖\hat{\bm{f}}_{i}over^ start_ARG bold_italic_f end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPTs in 𝒜𝒜\mathcal{A}caligraphic_A are estimated using the respective μ𝜇\muitalic_μ and σ𝜎\sigmaitalic_σ of the statistical Gaussian Distribution of 𝒇asuperscript𝒇𝑎\bm{f}^{a}bold_italic_f start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT, 𝒍𝒍\bm{l}bold_italic_l and 𝒐𝒐\bm{o}bold_italic_o. Nevertheless, employing a single set of μ𝜇\muitalic_μ and σ𝜎\sigmaitalic_σ for all attributes may lack accuracy. Therefore, we assume anchor attributes 𝒜𝒜\mathcal{A}caligraphic_A’s values independent, and construct their respective Gaussian distributions, where their individual 𝝁𝝁\bm{\mu}bold_italic_μ and 𝝈𝝈\bm{\sigma}bold_italic_σ are estimated by a simple MLP-based context model MLPcsubscriptMLPc{\rm MLP_{c}}roman_MLP start_POSTSUBSCRIPT roman_c end_POSTSUBSCRIPT from 𝒇hsuperscript𝒇\bm{f}^{h}bold_italic_f start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT. Specifically, for the i𝑖iitalic_ith anchor and its quantized anchor attribute vector 𝒇^isubscript^𝒇𝑖\hat{\bm{f}}_{i}over^ start_ARG bold_italic_f end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, with the estimated 𝝁iDsubscript𝝁𝑖superscript𝐷\bm{\mu}_{i}\in\mathbb{R}^{D}bold_italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT and 𝝈iDsubscript𝝈𝑖superscript𝐷\bm{\sigma}_{i}\in\mathbb{R}^{D}bold_italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT, we can compute the probability of 𝒇^isubscript^𝒇𝑖\hat{\bm{f}}_{i}over^ start_ARG bold_italic_f end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT as,

p(𝒇^i)𝑝subscript^𝒇𝑖\displaystyle p(\hat{\bm{f}}_{i})italic_p ( over^ start_ARG bold_italic_f end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) =𝒇^i12𝒒i𝒇^i+12𝒒iϕ𝝁i,𝝈i(x)𝑑xabsentsuperscriptsubscriptsubscript^𝒇𝑖12subscript𝒒𝑖subscript^𝒇𝑖12subscript𝒒𝑖subscriptitalic-ϕsubscript𝝁𝑖subscript𝝈𝑖𝑥differential-d𝑥\displaystyle=\int_{\hat{\bm{f}}_{i}-\frac{1}{2}\bm{q}_{i}}^{\hat{\bm{f}}_{i}+% \frac{1}{2}\bm{q}_{i}}\phi_{\bm{\mu}_{i},\bm{\sigma}_{i}}\left(x\right)\,dx= ∫ start_POSTSUBSCRIPT over^ start_ARG bold_italic_f end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG bold_italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT over^ start_ARG bold_italic_f end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + divide start_ARG 1 end_ARG start_ARG 2 end_ARG bold_italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_ϕ start_POSTSUBSCRIPT bold_italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x ) italic_d italic_x (6)
=Φ𝝁i,𝝈i(𝒇^i+12𝒒i)Φ𝝁i,𝝈i(𝒇^i12𝒒i)absentsubscriptΦsubscript𝝁𝑖subscript𝝈𝑖subscript^𝒇𝑖12subscript𝒒𝑖subscriptΦsubscript𝝁𝑖subscript𝝈𝑖subscript^𝒇𝑖12subscript𝒒𝑖\displaystyle=\Phi_{\bm{\mu}_{i},\bm{\sigma}_{i}}\left(\hat{\bm{f}}_{i}+\frac{% 1}{2}\bm{q}_{i}\right)-\Phi_{\bm{\mu}_{i},\bm{\sigma}_{i}}\left(\hat{\bm{f}}_{% i}-\frac{1}{2}\bm{q}_{i}\right)= roman_Φ start_POSTSUBSCRIPT bold_italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG bold_italic_f end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + divide start_ARG 1 end_ARG start_ARG 2 end_ARG bold_italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - roman_Φ start_POSTSUBSCRIPT bold_italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG bold_italic_f end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG bold_italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT )
𝝁i,𝝈isubscript𝝁𝑖subscript𝝈𝑖\displaystyle\bm{\mu}_{i},\bm{\sigma}_{i}bold_italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT =MLPc(𝒇ih).absentsubscriptMLPcsubscriptsuperscript𝒇𝑖\displaystyle={\rm MLP_{c}}\left(\bm{f}^{h}_{i}\right).= roman_MLP start_POSTSUBSCRIPT roman_c end_POSTSUBSCRIPT ( bold_italic_f start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) .

where ϕitalic-ϕ\phiitalic_ϕ and ΦΦ\Phiroman_Φ represent the probability density function and the cumulative distribution function, respectively. Consequently, we define an entropy loss as the summation of bit consumption over all 𝒇^isubscriptbold-^𝒇𝑖\bm{\hat{f}}_{i}overbold_^ start_ARG bold_italic_f end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPTs:

Lentropy=𝒇{𝒇a,𝒍,𝒐}i=1Nj=1D(log2p(f^i,j))subscript𝐿entropysubscript𝒇superscript𝒇𝑎𝒍𝒐superscriptsubscript𝑖1𝑁superscriptsubscript𝑗1𝐷subscript2𝑝subscript^𝑓𝑖𝑗L_{\text{entropy}}=\sum_{\bm{f}\in\{\bm{f}^{a},\bm{l},\bm{o}\}}\sum_{i=1}^{N}% \sum_{j=1}^{D}\left(-\log_{2}p(\hat{{f}}_{i,j})\right)italic_L start_POSTSUBSCRIPT entropy end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT bold_italic_f ∈ { bold_italic_f start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT , bold_italic_l , bold_italic_o } end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT ( - roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_p ( over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ) ) (7)

where N𝑁Nitalic_N is the number of anchors and f^i,jsubscript^𝑓𝑖𝑗\hat{{f}}_{i,j}over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT is j𝑗jitalic_j-th dimension value of 𝒇^isubscriptbold-^𝒇𝑖\bm{\hat{f}}_{i}overbold_^ start_ARG bold_italic_f end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Minimizing the entropy loss will encourage a high probability estimation for p(𝒇^i)𝑝subscript^𝒇𝑖p(\hat{\bm{f}}_{i})italic_p ( over^ start_ARG bold_italic_f end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ), which in turn encourages an accurate 𝝁isubscript𝝁𝑖\bm{\mu}_{i}bold_italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and a small 𝝈isubscript𝝈𝑖\bm{\sigma}_{i}bold_italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and thus guides the learning of the context model.

Adaptive Offset Masking. From Fig. 3 left, we can also see that 𝒐𝒐\bm{o}bold_italic_o exhibits an impulse at zero, suggesting the occurrence of substantial unnecessary Gaussians. Thus, we employ the technique introduced by Lee et al. [20] to prune invalid 𝒐𝒐\bm{o}bold_italic_o by utilizing straight-through [3] estimated binary masks. Specifically, we apply the same marking loss Lmsubscript𝐿𝑚L_{m}italic_L start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT in [20] to encourage masking as many Gaussians as possible. This process effectively masks out invalid offsets and saves storage space directly. Additionally, we implement anchor pruning: if all the attached 𝒐𝒐\bm{o}bold_italic_o are pruned on an anchor, then this anchor no longer contributes to rendering and should be pruned entirely (including its 𝒙asuperscript𝒙𝑎\bm{x}^{a}bold_italic_x start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT and 𝒜𝒜\mathcal{A}caligraphic_A).

Hash Grid Compression. As shown in (3), the size of the hash grid \mathcal{H}caligraphic_H also significantly influences the final storage size. To this end, we binarize the hash table to {1,+1}11\{-1,+1\}{ - 1 , + 1 } using straight-through estimation (STE) [35] and calculate the occurrence frequency hfsubscript𝑓h_{f}italic_h start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT [6] of the symbol “+11+1+ 1” to estimate its bit consumption:

Lhash=M+×(log2(hf))+M×(log2(1hf))subscript𝐿hashsubscript𝑀subscript2subscript𝑓subscript𝑀subscript21subscript𝑓L_{\text{hash}}=M_{+}\times\left(-\log_{2}(h_{f})\right)+M_{-}\times\left(-% \log_{2}(1-h_{f})\right)italic_L start_POSTSUBSCRIPT hash end_POSTSUBSCRIPT = italic_M start_POSTSUBSCRIPT + end_POSTSUBSCRIPT × ( - roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_h start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ) ) + italic_M start_POSTSUBSCRIPT - end_POSTSUBSCRIPT × ( - roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( 1 - italic_h start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ) ) (8)

where M+subscript𝑀M_{+}italic_M start_POSTSUBSCRIPT + end_POSTSUBSCRIPT and Msubscript𝑀M_{-}italic_M start_POSTSUBSCRIPT - end_POSTSUBSCRIPT are total numbers of “+11+1+ 1” and “11-1- 1” in the hash grid.

3.4 Training and Coding Process

During training, we incorporate both the rendering fidelity loss and the entropy loss to ensure the model improves rendering quality while controlling total bitrate consumption in a differentiable manner. Our overall loss is

Loss=LScaffold+λe1N(Da+6+3K)(Lentropy+Lhash)+λmLm.𝐿𝑜𝑠𝑠subscript𝐿Scaffoldsubscript𝜆𝑒1𝑁superscript𝐷𝑎63𝐾subscript𝐿entropysubscript𝐿hashsubscript𝜆𝑚subscript𝐿𝑚Loss=L_{\text{Scaffold}}+\lambda_{e}\frac{1}{N(D^{a}+6+3K)}(L_{\text{entropy}}% +L_{\text{hash}})+\lambda_{m}L_{m}.italic_L italic_o italic_s italic_s = italic_L start_POSTSUBSCRIPT Scaffold end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_N ( italic_D start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT + 6 + 3 italic_K ) end_ARG ( italic_L start_POSTSUBSCRIPT entropy end_POSTSUBSCRIPT + italic_L start_POSTSUBSCRIPT hash end_POSTSUBSCRIPT ) + italic_λ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT . (9)

Here, LScaffoldsubscript𝐿ScaffoldL_{\text{Scaffold}}italic_L start_POSTSUBSCRIPT Scaffold end_POSTSUBSCRIPT represents the rendering loss as defined in [25], which includes two fidelity penalty loss terms and one regularization term for the scaling 𝒍𝒍\bm{l}bold_italic_l. The second part in (9) is the estimated controllable bit consumption, including the estimated bits Lentropysubscript𝐿entropyL_{\text{entropy}}italic_L start_POSTSUBSCRIPT entropy end_POSTSUBSCRIPT for anchor attributes and Lhashsubscript𝐿hashL_{\text{hash}}italic_L start_POSTSUBSCRIPT hash end_POSTSUBSCRIPT for the hash grid. The last term Lmsubscript𝐿𝑚L_{m}italic_L start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT in (9) is the masking loss adopted from [20] to regularize the adaptive offset masking module. λesubscript𝜆𝑒\lambda_{e}italic_λ start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT and λmsubscript𝜆𝑚\lambda_{m}italic_λ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT are trade-off hyperparameters used to balance the loss components. Note that we incorporate different techniques or loss items at different iterations to stabilize the training process. Please refer to the supplementary Sec.A for more details.

For the encoding/decoding process, the binary hash grid \mathcal{H}caligraphic_H is first encoded/ decoded using AE with hfsubscript𝑓h_{f}italic_h start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT. Then, hash feature 𝒇hsuperscript𝒇\bm{f}^{h}bold_italic_f start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT is obtained through interpolation based on \mathcal{H}caligraphic_H and 𝒙asuperscript𝒙𝑎\bm{x}^{a}bold_italic_x start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT. Once 𝒇hsuperscript𝒇\bm{f}^{h}bold_italic_f start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT is acquired, the context models MLPqsubscriptMLPq{\rm MLP_{q}}roman_MLP start_POSTSUBSCRIPT roman_q end_POSTSUBSCRIPT and MLPcsubscriptMLPc{\rm MLP_{c}}roman_MLP start_POSTSUBSCRIPT roman_c end_POSTSUBSCRIPT are then employed to estimate quantization refinement term 𝒓𝒓\bm{r}bold_italic_r and parameters of the Gaussian Distribution (i.e., 𝝁𝝁\bm{\mu}bold_italic_μ and 𝝈𝝈\bm{\sigma}bold_italic_σ) to derive the probability p(𝒇^)𝑝bold-^𝒇p(\bm{\hat{f}})italic_p ( overbold_^ start_ARG bold_italic_f end_ARG ) for entropy encoding/decoding with AE.

4 Experiments

In this section, we first present our HAC framework’s implementation details and then conduct evaluation experiments to compare with existing 3DGS compression approaches. Additionally, we include ablation studies to demonstrate the effectiveness of each technical component of our method. Finally, we visualize the bit allocation map for better understanding.

4.1 Implementation Details

We implement our HAC based on the Scaffold-GS repository [25] using the PyTorch framework [31] and train the model on a single NVIDIA RTX 4090 GPU. We increase the dimension of the Scaffold-GS anchor feature 𝒇asuperscript𝒇𝑎\bm{f}^{a}bold_italic_f start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT (i.e., Dasuperscript𝐷𝑎D^{a}italic_D start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT) to 50, and disable its feature bank as we found it may lead to unstable training. For the hash grid \mathcal{H}caligraphic_H, we utilize a mixed 3D-2D structured binary hash grid, with 12 levels of 3D embeddings ranging from 16 to 512 resolutions, and 4 levels of 2D embeddings ranging from 128 to 1024 resolutions. The maximum hash table sizes are 213superscript2132^{13}2 start_POSTSUPERSCRIPT 13 end_POSTSUPERSCRIPT and 215superscript2152^{15}2 start_POSTSUPERSCRIPT 15 end_POSTSUPERSCRIPT for the 3D and 2D grids, respectively, both with a feature dimension of Dh=4superscript𝐷4D^{h}=4italic_D start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT = 4. We set λmsubscript𝜆𝑚\lambda_{m}italic_λ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT to 5e45𝑒45e-45 italic_e - 4, and change λesubscript𝜆𝑒\lambda_{e}italic_λ start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT from 5e45𝑒45e-45 italic_e - 4 to 4e34𝑒34e-34 italic_e - 3 for variable bitrates. We set Q0subscript𝑄0Q_{0}italic_Q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT as 1111, 0.0010.0010.0010.001 and 0.20.20.20.2 for 𝒇asuperscript𝒇𝑎\bm{f}^{a}bold_italic_f start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT, 𝒍𝒍\bm{l}bold_italic_l and 𝒐𝒐\bm{o}bold_italic_o, respectively. We combine MLPqsubscriptMLPq{\rm MLP_{q}}roman_MLP start_POSTSUBSCRIPT roman_q end_POSTSUBSCRIPT and MLPcsubscriptMLPc{\rm MLP_{c}}roman_MLP start_POSTSUBSCRIPT roman_c end_POSTSUBSCRIPT to a single 3-layer MLP with ReLU activation.

Table 1: Quantitative results of our approach and others. 3DGS [17] and Scaffold-GS [25] are two baselines. Approaches in the middle chunk are designed for 3DGS compression. For our approach, we give two results of different size and fidelity tradeoffs by adjusting λesubscript𝜆𝑒\lambda_{e}italic_λ start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT. A smaller λesubscript𝜆𝑒\lambda_{e}italic_λ start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT results in a larger size but improved fidelity, and vice versa. The best and 2nd best results are highlighted in red and yellow cells. The size is measured in MB.
Datasets Synthetic-NeRF [26] Mip-NeRF360 [2] Tank&Temples [18]
methods psnr\uparrow ssim\uparrow lpips\downarrow size\downarrow psnr\uparrow ssim\uparrow lpips\downarrow size\downarrow psnr\uparrow ssim\uparrow lpips\downarrow size\downarrow
3DGS [17] 33.80 0.970 0.031 68.46 27.49 0.813 0.222 744.7 23.69 0.844 0.178 431.0
Scaffold-GS [25] 33.41 0.966 0.035 19.36 27.50 0.806 0.252 253.9 23.96 0.853 0.177 86.50
Lee et al. [20] 33.33 0.968 0.034 5.54 27.08 0.798 0.247 48.80 23.32 0.831 0.201 39.43
Compressed3D [30] 32.94 0.967 0.033 3.68 26.98 0.801 0.238 28.80 23.32 0.832 0.194 17.28
EAGLES[12] 32.54 0.965 0.039 5.74 27.15 0.808 0.238 68.89 23.41 0.840 0.200 34.00
LightGaussian [10] 32.73 0.965 0.037 7.84 27.00 0.799 0.249 44.54 22.83 0.822 0.242 22.43
Morgenstern et al. [27] 31.05 0.955 0.047 2.20 26.01 0.772 0.259 23.90 22.78 0.817 0.211 13.05
Navaneet et al. [29] 33.09 0.967 0.036 4.42 27.16 0.808 0.228 50.30 23.47 0.840 0.188 27.97
Ours-lowrate 33.24 0.967 0.037 1.18 27.53 0.807 0.238 15.26 24.04 0.846 0.187 8.10
Ours-highrate 33.71 0.968 0.034 1.86 27.77 0.811 0.230 21.87 24.40 0.853 0.177 11.24
Datasets DeepBlending [16] BungeeNeRF [40]
methods psnr\uparrow ssim\uparrow lpips\downarrow size\downarrow psnr\uparrow ssim\uparrow lpips\downarrow size\downarrow
3DGS [17] 29.42 0.899 0.247 663.9 24.87 0.841 0.205 1616
Scaffold-GS [25] 30.21 0.906 0.254 66.00 26.62 0.865 0.241 183.0
Lee et al. [20] 29.79 0.901 0.258 43.21 23.36 0.788 0.251 82.60
Compressed3D [30] 29.38 0.898 0.253 25.30 24.13 0.802 0.245 55.79
EAGLES[12] 29.91 0.910 0.250 62.00 25.24 0.843 0.221 117.1
LightGaussian [10] 27.01 0.872 0.308 33.94 24.52 0.825 0.255 87.28
Morgenstern et al. [27] 28.92 0.891 0.276 8.40 -- -- -- --
Navaneet et al. [29] 29.75 0.903 0.247 42.77 24.63 0.823 0.239 104.3
Ours-lowrate 29.98 0.902 0.269 4.35 26.48 0.845 0.250 18.49
Ours-highrate 30.34 0.906 0.258 6.35 27.08 0.872 0.209 29.72

4.2 Experiment Evaluation

Baselines. We compare our HAC with existing 3DGS compression approaches. Notably, [20, 30, 29, 10] mainly adopt codebook-based or parameter pruning strategies, while Scaffold-GS [25] explores Gaussian relations for compact representation. Additionally, EAGLES [12] and Morgenstern et al. [27] employ non-contextual entropy constraints and dimension collapse techniques, respectively.

Datasets. We follow Scaffold-GS to perform evaluations on multiple datasets, including the small-scale Synthetic-NeRF [26] and the four large-scale real-scene datasets: BungeeNeRF [40], DeepBlending [16], Mip-NeRF360 [2], and Tanks&Temples [18]. Note that we evaluate the entire 9 scenes from Mip-NeRF360 dataset [2]. Covering diverse scenarios, these datasets allow us to comprehensively demonstrate the effectiveness of our approach.

Metrics. To comprehensively evaluate compression Rate-Distortion (RD) performance, we calculate relative rate (size) change of our approach over others under a similar fidelity. Note that BD-rate [4] is incalculable as other methods can typically only output a single rate, while four are needed for its calculation.

Refer to caption
Figure 4: RD curves for quantitative comparisons. We vary λesubscript𝜆𝑒\lambda_{e}italic_λ start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT to achieve variable bitrates. Note that log10subscriptlog10{\rm log_{10}}roman_log start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT scale is used for x-axis for better visualization.
Refer to caption
Figure 5: Qualitative comparisons of “pompidou” from BungeeNeRF [40] and “flowers” from Mip-NeRF360 [2]. PSNR and size results are given at lower-left.

Results. Quantitative results are shown in Tab. 1 and Fig. 4, and the qualitative outputs are presented in Fig 5. Please refer to the supplementary Sec.C for the detailed metrics of each scene. Our proposed HAC has demonstrated significant size reduction of over 75×75\times75 × when compared to the vanilla 3DGS [17] with even improved fidelity. The size reduction also exceeds 11×11\times11 × over the base model Scaffold-GS [25]. Notably, our highest fidelity surpasses Scaffold-GS, primarily due to two factors: 1) the entropy loss effectively regularizes the model to prevent overfitting, and 2) we increase the dimension of the anchor feature (i.e., Dasuperscript𝐷𝑎D^{a}italic_D start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT) to 50, resulting in a larger model volume. Although other compression approaches (mid chunk) can reduce the model size to some extent by primarily using pruning and codebooks, they still exhibit significant spatial redundancy. Specifically, Morgenstern [27] achieves a comparably small size, but it significantly sacrifices fidelity due to the dimension collapsing. Note that the relative size changes must be measured under similar fidelity quality.

Bitstream. Our bitstream consists of five components: anchor attributes 𝒜𝒜\mathcal{A}caligraphic_A (comprising 𝒇asuperscript𝒇𝑎\bm{f}^{a}bold_italic_f start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT, 𝒍𝒍\bm{l}bold_italic_l and 𝒐𝒐\bm{o}bold_italic_o), binary hash grid \mathcal{H}caligraphic_H, offset masks, anchor locations 𝒙asuperscript𝒙𝑎\bm{x}^{a}bold_italic_x start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT and MLPs. Among them, 𝒜𝒜\mathcal{A}caligraphic_A is encoded using entropy codec AE [39] with estimated probabilities from HAC. It accounts for the dominant portion of the storage. The hash grid \mathcal{H}caligraphic_H and the masks are binary data and are encoded by AE using the respective occurrence frequency. The last two components are stored directly in 16 and 32 bits, respectively. When analyzing the bit allocation of each component, they are 14.90MB (8.76MB, 2.52MB, 3.62MB), 0.15MB, 0.52MB, 2.77MB, and 0.16MB for these five components on the most challenging BungeeNeRF dataset [40] with λe=4e3subscript𝜆𝑒4𝑒3\lambda_{e}=4e-3italic_λ start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT = 4 italic_e - 3. With scenes become simpler, the storage share of 𝒜𝒜\mathcal{A}caligraphic_A decreases as the value distribution become easier to predict.

Refer to caption
Figure 6: Ablations of different components in HAC. We vary λesubscript𝜆𝑒\lambda_{e}italic_λ start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT for variable rates.

4.3 Ablation Study

In this subsection, we conduct ablation studies to demonstrate effectiveness of each technical component. We conduct experiments on both the most challenging large-scale BungeeNeRF dataset [40] and the small-scale Synthetic-NeRF dataset [26] to support convincing and solid results. We assess the effectiveness of individual technical components by disabling either of the following: 1) mutual information from the hash grid, 2) the adaptive quantization module, 3) adaptive offset masking. The results are presented in Fig. 6. Firstly, we set the hash grid to all zeros to eliminate mutual information. This leads to a degradation of conditional probability from p(𝒜|𝒇h)𝑝conditional𝒜superscript𝒇p(\mathcal{A}|\bm{f}^{h})italic_p ( caligraphic_A | bold_italic_f start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ) to p(𝒜)𝑝𝒜p(\mathcal{A})italic_p ( caligraphic_A ), which indicates that probability of 𝒇^bold-^𝒇\bm{\hat{f}}overbold_^ start_ARG bold_italic_f end_ARG can only be estimated by the statistic μ𝜇\muitalic_μ and σ𝜎\sigmaitalic_σ from the left part of Fig. 3. Consequently, the bit consumption drastically increases as the probability can no longer be accurately estimated. Regarding the latter two components, they contribute from different perspectives. Disabling AQM (we remove 𝒓𝒓{\bm{r}}bold_italic_r while retaining Q0subscript𝑄0Q_{0}italic_Q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT to ensure a necessary decimal quantization step) results in a significant drop in fidelity, especially in more complex scenes or at higher rates, as 𝒇^bold-^𝒇\bm{\hat{f}}overbold_^ start_ARG bold_italic_f end_ARG fails to retain sufficient information for rendering after quantization. Differently, offset masking can achieve remarkable rate savings in simpler scenes or lower rate segments due to more significant positional redundancy in Gaussians. Overall, all three components provide a worthwhile tradeoff for improved RD performance.

Refer to caption
Figure 7: Visualization of bit allocation maps for the scenes “materials” and “drums” on Synthetic-NeRF dataset [26]. The 3D space is voxelized, with each voxel represented by a ball and the radius of a ball indicating the number of anchors in the voxel. For the 2nd column, the color of a ball indicates the total bit consumption of all anchors in the voxel, while for the 4th column, the color represents the averaged bit consumption per anchor within a voxel. The 3rd column gives zoom-in views. It shows more anchors are allocated to important regions while the bit consumption for each anchor is smooth.

4.4 Visualization of Bit Allocation

While HAC measures the parameters’ bit consumption, we are interested in the bit allocation across different local areas in the space. We utilize the Synthetic-NeRF dataset [26] for observation, as its object-centered scenes are well-suited for visualization. As shown in Fig. 7, the bit allocation is represented by voxelized colored balls. As observed from the 2nd column of visualized sub-figures, the model tends to allocate more total bits to areas with complex appearances or sharp edges. For instance, specular objects in “materials” and instrument stands in “drums” exhibit higher total bit consumption due to the complexity of textures for those regions. The analysis of the 4th column from an averaging viewpoint reveals varied trends in bit consumption per anchor. In high bit-consumption voxels, creating more anchors for precise modeling can average the bit rate per anchor, smoothing or reducing their bit consumption. This phenomenon aligns with our assumption that anchors demonstrate inherent consistency in the 3D space where nearby anchors exhibit similar values of attributes, making it easier for the hash grid to accurately estimate their value distribution probabilities.

4.5 Training and Execution Time

Training time. Integration of additional models in HAC results in an increase in training time, approximately 0.9×0.9\times0.9 × longer than Scaffold-GS. Specifically, for the challenging city-scale BungeeNeRF dataset [40], the training times are 38.2 minutes for 3DGS [17], 15.1 minutes for Scaffold-GS [25] and 27.6 minutes for our model. For the small-scale Synthetic-NeRF dataset [26], training times are 3.4 minutes, 4.4 minutes and 9.0 minutes, respectively. This increase of training time in our model over Scaffold-GS is our main limitation, but it is still fast.

Coding time. The encoding/decoding process takes approximately 0.87 seconds and 26.7 seconds on Synthetic-NeRF and BungeeNeRF dataset under λe=4e3subscript𝜆𝑒4𝑒3\lambda_{e}=4e-3italic_λ start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT = 4 italic_e - 3, respectively. The dominant time consumption occurs during Codec execution of AE on the CPU (over 90%percent9090\%90 %), as we only use a single thread.

Inference time. The inference process benefits from the design of the context modeling, allowing for the removal of the hash grid once 𝒜𝒜\mathcal{A}caligraphic_A is fully decoded. Consequently, no additional operations are required during rendering, resulting in a similar FPS with Scaffold-GS.

5 Conclusion

In this paper, we have pioneered an investigation into the relationship between unorganized and sparse Gaussians (or anchors in our paper) and well-structured hash grids, leveraging their mutual information for compact 3DGS representations. Our proposed Hash-grid Assisted Context (HAC) framework has achieved SOTA compression performance with remarkable leading over the concurrent works. Extensive experiments have demonstrated the effectiveness of our HAC and its technical components. Overall, our work has successfully mitigated the major challenging of the 3DGS model, i.e., large storage requirement, enabling its adoption in large-scale scenes and diverse devices.

References

  • [1] Ballé, J., Minnen, D., Singh, S., Hwang, S.J., Johnston, N.: Variational image compression with a scale hyperprior. In: International Conference on Learning Representations (2018)
  • [2] Barron, J.T., Mildenhall, B., Verbin, D., Srinivasan, P.P., Hedman, P.: Mip-nerf 360: Unbounded anti-aliased neural radiance fields. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 5470–5479 (2022)
  • [3] Bengio, Y., Léonard, N., Courville, A.: Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432 (2013)
  • [4] Bjontegaard, G.: Calculation of average psnr differences between rd-curves. ITU SG16 Doc. VCEG-M33 (2001)
  • [5] Chen, A., Xu, Z., Geiger, A., Yu, J., Su, H.: Tensorf: Tensorial radiance fields. In: European Conference on Computer Vision. pp. 333–350. Springer (2022)
  • [6] Chen, Y., Wu, Q., Harandi, M., Cai, J.: How far can we compress instant-ngp-based nerf? In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2024)
  • [7] Cheng, Z., Sun, H., Takeuchi, M., Katto, J.: Learned image compression with discretized gaussian mixture likelihoods and attention modules. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 7939–7948 (2020)
  • [8] Cover, T.M.: Elements of information theory. John Wiley & Sons (1999)
  • [9] Deng, C.L., Tartaglione, E.: Compressing explicit voxel grid representations: fast nerfs become also small. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. pp. 1236–1245 (2023)
  • [10] Fan, Z., Wang, K., Wen, K., Zhu, Z., Xu, D., Wang, Z.: Lightgaussian: Unbounded 3d gaussian compression with 15x reduction and 200+ fps. arXiv preprint arXiv:2311.17245 (2023)
  • [11] Fridovich-Keil, S., Meanti, G., Warburg, F.R., Recht, B., Kanazawa, A.: K-planes: Explicit radiance fields in space, time, and appearance. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 12479–12488 (2023)
  • [12] Girish, S., Gupta, K., Shrivastava, A.: Eagles: Efficient accelerated 3d gaussians with lightweight encodings. arXiv preprint arXiv:2312.04564 (2023)
  • [13] Girish, S., Shrivastava, A., Gupta, K.: Shacira: Scalable hash-grid compression for implicit neural representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 17513–17524 (2023)
  • [14] He, D., Yang, Z., Peng, W., Ma, R., Qin, H., Wang, Y.: Elic: Efficient learned image compression with unevenly grouped space-channel contextual adaptive coding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 5718–5727 (2022)
  • [15] He, D., Zheng, Y., Sun, B., Wang, Y., Qin, H.: Checkerboard context model for efficient learned image compression. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 14771–14780 (2021)
  • [16] Hedman, P., Philip, J., Price, T., Frahm, J.M., Drettakis, G., Brostow, G.: Deep blending for free-viewpoint image-based rendering. ACM Transactions on Graphics (ToG) 37(6), 1–15 (2018)
  • [17] Kerbl, B., Kopanas, G., Leimkühler, T., Drettakis, G.: 3d gaussian splatting for real-time radiance field rendering. ACM Transactions on Graphics 42(4) (2023)
  • [18] Knapitsch, A., Park, J., Zhou, Q.Y., Koltun, V.: Tanks and temples: Benchmarking large-scale scene reconstruction. ACM Transactions on Graphics (ToG) 36(4), 1–13 (2017)
  • [19] Lassner, C., Zollhofer, M.: Pulsar: Efficient sphere-based neural rendering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 1440–1449 (2021)
  • [20] Lee, J.C., Rho, D., Sun, X., Ko, J.H., Park, E.: Compact 3d gaussian representation for radiance field. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2024)
  • [21] Li, J., Li, B., Lu, Y.: Deep contextual video compression. Advances in Neural Information Processing Systems 34, 18114–18125 (2021)
  • [22] Li, J., Li, B., Lu, Y.: Hybrid spatial-temporal entropy modelling for neural video compression. In: Proceedings of the 30th ACM International Conference on Multimedia. pp. 1503–1511 (2022)
  • [23] Li, L., Shen, Z., Wang, Z., Shen, L., Bo, L.: Compressing volumetric radiance fields to 1 mb. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 4222–4231 (2023)
  • [24] Li, L., Wang, Z., Shen, Z., Shen, L., Tan, P.: Compact real-time radiance fields with neural codebook. In: ICME (2023)
  • [25] Lu, T., Yu, M., Xu, L., Xiangli, Y., Wang, L., Lin, D., Dai, B.: Scaffold-gs: Structured 3d gaussians for view-adaptive rendering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2024)
  • [26] Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: Nerf: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM 65(1), 99–106 (2021)
  • [27] Morgenstern, W., Barthel, F., Hilsmann, A., Eisert, P.: Compact 3d scene representation via self-organizing gaussian grids. arXiv preprint arXiv:2312.13299 (2023)
  • [28] Müller, T., Evans, A., Schied, C., Keller, A.: Instant neural graphics primitives with a multiresolution hash encoding. ACM Transactions on Graphics (ToG) 41(4), 1–15 (2022)
  • [29] Navaneet, K., Meibodi, K.P., Koohpayegani, S.A., Pirsiavash, H.: Compact3d: Compressing gaussian splat radiance field models with vector quantization. arXiv preprint arXiv:2311.18159 (2023)
  • [30] Niedermayr, S., Stumpfegger, J., Westermann, R.: Compressed 3d gaussian splatting for accelerated novel view synthesis. arXiv preprint arXiv:2401.02436 (2023)
  • [31] Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al.: Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 (2019)
  • [32] Rho, D., Lee, B., Nam, S., Lee, J.C., Ko, J.H., Park, E.: Masked wavelet representation for compact neural radiance fields. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 20680–20690 (2023)
  • [33] Schonberger, J.L., Frahm, J.M.: Structure-from-motion revisited. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (June 2016)
  • [34] Sheng, X., Li, J., Li, B., Li, L., Liu, D., Lu, Y.: Temporal context mining for learned video compression. IEEE Transactions on Multimedia (2022)
  • [35] Shin, S., Park, J.: Binary radiance fields. Advances in neural information processing systems (2023)
  • [36] Song, Z., Duan, W., Zhang, Y., Wang, S., Ma, S., Gao, W.: Spc-nerf: Spatial predictive compression for voxel based radiance field. arXiv preprint arXiv:2402.16366 (2024)
  • [37] Sun, C., Sun, M., Chen, H.T.: Direct voxel grid optimization: Super-fast convergence for radiance fields reconstruction. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 5459–5469 (2022)
  • [38] Tang, J., Chen, X., Wang, J., Zeng, G.: Compressible-composable nerf via rank-residual decomposition. Advances in Neural Information Processing Systems 35, 14798–14809 (2022)
  • [39] Witten, I.H., Neal, R.M., Cleary, J.G.: Arithmetic coding for data compression. Communications of the ACM 30(6), 520–540 (1987)
  • [40] Xiangli, Y., Xu, L., Pan, X., Zhao, N., Rao, A., Theobalt, C., Dai, B., Lin, D.: Bungeenerf: Progressive neural radiance field for extreme multi-scale scene rendering. In: European conference on computer vision. pp. 106–122. Springer (2022)

–Supplementary Material–

Refer to caption
Figure A: Detailed training process of our HAC model. We use the blue box to indicate the training process related to our model, while using the pink box for Scaffold-GS [25].

A More Implementation Details

A.1 Training Process

We provide a detailed overview of the training process for our HAC framework, as illustrated in Fig. A.

During the initial 3000 iterations, no additional techniques are applied to impact the original training process of Scaffold-GS [25], ensuring a stable start of the anchor attribute training and anchor spawning.

From iteration 3000 to 10000, we introduce “adding noise” operations to anchor attributes 𝒜𝒜\mathcal{A}caligraphic_A, which allows the model to adapt to the quantization process. Note that, in this stage, we only apply Q0subscript𝑄0Q_{0}italic_Q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT for quantization without using 𝒓𝒓\bm{r}bold_italic_r for refinement, therefore, we do not need the hash grid. Specifically, we pause the anchor spawning process between iterations 3000 and 4000 for a transitional period, as the sudden introduction of quantization may introduce instability to the spawning process. Once the parameters are fitted to the quantization after iteration 4000, we re-enable the spawning process. Note that we do not incorporate the hash grid in this stage (i.e., before iteration 10000) because we want to provide a transition for the anchor attributes and the spawning process to fit the quantization operation, enabling a more stable training process when the hash grid is incorporated in the further iterations.

After iteration 10000, assuming the 3D model is adequately fitted to the quantization, we fully integrate our HAC framework to jointly train the binary hash grid. Notably, the bound of the hash grid is determined using the maximum and minimum anchor locations at the 10000th iteration, which are then utilized to normalize anchor locations for interpolation in the hash grid. This comprehensive pipeline ensures a stable training process to reduce the model size via entropy constraints while maintaining a high-quality fidelity.

A.2 Sampling Strategy

During training, employing all anchors for entropy training in each iteration could result in prolonged training time and potential out-of-memory (OOM) issues. Therefore, we adopt a sampling strategy that, in each iteration, we only randomly sample and entropy train 5% of anchors from those used for rendering. This approach ensures faster training speeds while still preserving satisfactory RD performance.

B Additional Experiments

Table A: Bit allocation among anchor’s attributes. We set λe=4e3subscript𝜆𝑒4𝑒3\lambda_{e}=4e-3italic_λ start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT = 4 italic_e - 3. When calculating per-param size, we only consider valid anchors that are not pruned.
Dataset Total size (MB) Per-param size (bit)
𝒇asuperscript𝒇𝑎\bm{f}^{a}bold_italic_f start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT 𝒍𝒍\bm{l}bold_italic_l 𝒐𝒐\bm{o}bold_italic_o 𝒇asuperscript𝒇𝑎\bm{f}^{a}bold_italic_f start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT 𝒍𝒍\bm{l}bold_italic_l 𝒐𝒐\bm{o}bold_italic_o
Bungee-NeRF [40] 8.76 2.53 3.62 3.03 7.27 4.56
Synthetic-NeRF [26] 0.31 0.09 0.12 1.33 3.58 3.76

We investigate bit allocation among the anchor’s three attributes, as depicted in Table A. When viewing from the total size, feature 𝒇asuperscript𝒇𝑎\bm{f}^{a}bold_italic_f start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT contributes the most due to its highest dimensionality. However, as it needs to be inputted into MLPs to extract Gaussian attributes, it exhibits the most significant dimensional redundancy, resulting in the smallest per-parameter bit. Conversely, this is not the case for scaling 𝒍𝒍\bm{l}bold_italic_l and offsets 𝒐𝒐\bm{o}bold_italic_o, which are directly used for rendering, making much fewer dimensional redundancies. Additionally, as 𝒍𝒍\bm{l}bold_italic_l and 𝒐𝒐\bm{o}bold_italic_o are always of higher decimal precision, their value distributions are more difficult to accurately predict, resulting higher per-parameter bit consumption.

C Quantitative Results of Each Scene

C.1 Detailed Results of Our HAC Framework

Detailed per-scene results of Synthetic-NeRF dataset [26] are given in Tab. B.

Detailed per-scene results of Mip-NeRF360 dataset [2] are given in Tab. C.

Detailed per-scene results of Tank&Temples dataset [18] are given in Tab. D.

Detailed per-scene results of DeepBlending dataset [16] are given in Tab. E.

Detailed per-scene results of BungeeNeRF dataset [40] are given in Tab. F.

C.2 Detailed Results of the Base Models

We also give detailed per-scene results of all datasets of our two base models 3DGS [17] and Scaffold-GS [25] in Tab. G and Tab. H, respectively.

D Notation Table

Please refer to Tab. I for detailed notation explanations.

Table B: Per-scene results of Synthetic-NeRF dataset [26] of our approach.
λesubscript𝜆𝑒\lambda_{e}italic_λ start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT Scenes PSNR\uparrow SSIM\uparrow LPIPS\downarrow SIZE\downarrow
0.004 chair 34.02 0.981 0.018 0.82
drums 26.20 0.950 0.044 1.23
ficus 34.27 0.983 0.016 0.71
hotdog 36.44 0.979 0.033 0.51
lego 34.25 0.976 0.027 0.97
materials 30.20 0.959 0.045 1.07
mic 35.39 0.989 0.011 0.55
ship 31.24 0.902 0.124 1.82
AVG 32.75 0.965 0.040 0.96
0.003 chair 34.33 0.982 0.017 0.89
drums 26.26 0.951 0.043 1.42
ficus 34.57 0.984 0.015 0.82
hotdog 36.70 0.980 0.031 0.54
lego 34.65 0.977 0.024 1.07
materials 30.29 0.960 0.043 1.20
mic 35.62 0.990 0.010 0.62
ship 31.32 0.903 0.121 1.86
AVG 32.97 0.966 0.038 1.05
0.002 chair 34.73 0.984 0.016 1.03
drums 26.32 0.952 0.043 1.45
ficus 34.90 0.985 0.014 0.94
hotdog 37.11 0.981 0.029 0.64
lego 35.04 0.979 0.022 1.25
materials 30.53 0.961 0.041 1.45
mic 35.92 0.990 0.010 0.67
ship 31.38 0.903 0.119 1.99
AVG 33.24 0.967 0.037 1.18
0.001 chair 35.21 0.985 0.014 1.32
drums 26.38 0.952 0.041 1.95
ficus 35.37 0.986 0.013 1.20
hotdog 37.47 0.983 0.026 0.79
lego 35.51 0.981 0.019 1.61
materials 30.58 0.961 0.040 1.62
mic 36.25 0.991 0.009 0.81
ship 31.48 0.904 0.116 2.50
AVG 33.53 0.968 0.035 1.47
0.0005 chair 35.49 0.986 0.013 1.67
drums 26.45 0.952 0.041 2.32
ficus 35.30 0.986 0.013 1.53
hotdog 37.87 0.984 0.024 0.97
lego 35.67 0.981 0.019 1.90
materials 30.70 0.962 0.039 2.07
mic 36.71 0.992 0.008 1.01
ship 31.52 0.904 0.115 3.39
AVG 33.71 0.968 0.034 1.86
Table C: Per-scene results of Mip-NeRF360 dataset [2] of our approach.
λesubscript𝜆𝑒\lambda_{e}italic_λ start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT Scenes PSNR\uparrow SSIM\uparrow LPIPS\downarrow SIZE\downarrow
0.004 bicycle 25.05 0.742 0.264 27.54
garden 27.28 0.842 0.151 22.69
stump 26.58 0.762 0.269 18.11
room 31.55 0.921 0.208 5.53
counter 29.35 0.911 0.195 7.26
kitchen 31.16 0.923 0.131 8.05
bonsai 32.28 0.942 0.189 8.56
flower 21.26 0.572 0.381 19.59
treehill 23.30 0.645 0.356 20.04
AVG 27.53 0.807 0.238 15.26
0.003 bicycle 25.05 0.742 0.261 30.02
garden 27.36 0.844 0.148 24.62
stump 26.64 0.763 0.265 19.85
room 31.71 0.922 0.206 5.72
counter 29.54 0.913 0.191 7.93
kitchen 31.22 0.925 0.128 8.84
bonsai 32.50 0.944 0.186 9.40
flower 21.26 0.571 0.383 20.67
treehill 23.26 0.645 0.356 22.08
AVG 27.62 0.808 0.236 16.57
0.002 bicycle 25.10 0.742 0.262 33.14
garden 27.43 0.847 0.143 27.52
stump 26.59 0.761 0.268 21.75
room 31.87 0.925 0.201 6.47
counter 29.65 0.915 0.189 8.88
kitchen 31.46 0.928 0.125 10.05
bonsai 32.70 0.945 0.184 10.51
flower 21.32 0.576 0.377 23.73
treehill 23.34 0.647 0.350 24.83
AVG 27.72 0.809 0.233 18.54
0.001 bicycle 25.11 0.742 0.259 39.15
garden 27.46 0.849 0.139 32.17
stump 26.59 0.763 0.264 25.26
room 31.90 0.926 0.198 7.85
counter 29.74 0.918 0.184 10.44
kitchen 31.63 0.930 0.122 12.07
bonsai 32.97 0.948 0.180 12.72
flower 21.27 0.575 0.377 27.55
treehill 23.26 0.648 0.345 29.65
AVG 27.77 0.811 0.230 21.87
0.0005 bicycle 25.05 0.742 0.258 44.01
garden 27.50 0.850 0.139 36.27
stump 26.57 0.762 0.264 28.93
room 32.19 0.929 0.194 9.16
counter 29.75 0.918 0.185 12.22
kitchen 31.81 0.931 0.120 13.96
bonsai 33.16 0.949 0.178 14.90
flower 21.28 0.575 0.376 31.24
treehill 23.22 0.646 0.346 34.42
AVG 27.83 0.811 0.229 25.01
Table D: Per-scene results of Tank&Temples dataset [18] of our approach.
λesubscript𝜆𝑒\lambda_{e}italic_λ start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT Scenes PSNR\uparrow SSIM\uparrow LPIPS\downarrow SIZE\downarrow
0.004 truck 25.88 0.878 0.158 9.26
train 22.19 0.815 0.216 6.94
AVG 24.04 0.846 0.187 8.10
0.003 truck 25.99 0.880 0.153 9.80
train 22.49 0.817 0.213 7.59
AVG 24.24 0.849 0.183 8.70
0.002 truck 25.99 0.881 0.153 11.15
train 22.66 0.819 0.210 8.64
AVG 24.33 0.850 0.181 9.90
0.001 truck 26.02 0.883 0.147 12.42
train 22.78 0.823 0.207 10.07
AVG 24.40 0.853 0.177 11.24
0.0005 truck 26.00 0.883 0.146 15.12
train 22.49 0.823 0.206 11.19
AVG 24.25 0.853 0.176 13.16
Table E: Per-scene results of DeepBlending dataset [16] of our approach.
λesubscript𝜆𝑒\lambda_{e}italic_λ start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT Scenes PSNR\uparrow SSIM\uparrow LPIPS\downarrow SIZE\downarrow
0.004 playroom 30.44 0.902 0.272 3.15
drjohnson 29.53 0.903 0.265 5.55
AVG 29.98 0.902 0.269 4.35
0.003 playroom 30.61 0.903 0.269 3.66
drjohnson 29.67 0.904 0.261 5.71
AVG 30.14 0.903 0.265 4.69
0.002 playroom 30.66 0.905 0.265 4.12
drjohnson 29.69 0.905 0.258 6.51
AVG 30.17 0.905 0.262 5.32
0.001 playroom 30.84 0.906 0.262 5.03
drjohnson 29.85 0.906 0.255 7.67
AVG 30.34 0.906 0.258 6.35
0.0005 playroom 30.66 0.906 0.259 6.08
drjohnson 29.76 0.906 0.255 9.09
AVG 30.21 0.906 0.257 7.58
Table F: Per-scene results of BungeeNeRF dataset [40] of our approach.
λesubscript𝜆𝑒\lambda_{e}italic_λ start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT Scenes PSNR\uparrow SSIM\uparrow LPIPS\downarrow SIZE\downarrow
0.004 amsterdam 26.80 0.865 0.224 22.49
bilbao 27.65 0.864 0.231 17.14
hollywood 24.25 0.748 0.347 16.55
pompidou 25.16 0.829 0.266 20.40
quebec 29.33 0.918 0.192 15.06
rome 25.68 0.845 0.243 19.30
AVG 26.48 0.845 0.250 18.49
0.003 amsterdam 26.95 0.873 0.214 24.41
bilbao 27.82 0.872 0.218 18.76
hollywood 24.27 0.753 0.342 17.87
pompidou 25.34 0.837 0.255 22.49
quebec 29.67 0.924 0.185 16.15
rome 25.98 0.855 0.231 20.83
AVG 26.67 0.852 0.241 20.08
0.002 amsterdam 27.13 0.880 0.202 27.14
bilbao 28.02 0.880 0.205 20.91
hollywood 24.43 0.763 0.330 20.09
pompidou 25.27 0.842 0.249 24.85
quebec 29.98 0.929 0.175 17.90
rome 26.28 0.866 0.219 23.07
AVG 26.85 0.860 0.230 22.33
0.001 amsterdam 27.25 0.886 0.190 31.84
bilbao 27.98 0.886 0.190 24.38
hollywood 24.59 0.772 0.319 23.41
pompidou 25.58 0.851 0.236 29.19
quebec 30.30 0.934 0.163 21.23
rome 26.61 0.876 0.203 26.91
AVG 27.05 0.868 0.217 26.16
0.0005 amsterdam 27.24 0.891 0.180 36.31
bilbao 28.09 0.891 0.181 27.72
hollywood 24.60 0.778 0.313 26.00
pompidou 25.60 0.853 0.231 33.55
quebec 30.19 0.936 0.155 24.56
rome 26.73 0.881 0.194 30.21
AVG 27.08 0.872 0.209 29.72
Table G: Per-scene results of all evaluated datasets of 3DGS [17].
Datasets Scenes PSNR\uparrow SSIM\uparrow LPIPS\downarrow SIZE\downarrow
Synthetic-NeRF chair 35.65 0.988 0.010 115.77
drums 26.28 0.955 0.037 92.87
ficus 35.48 0.987 0.012 64.53
hotdog 38.05 0.985 0.020 43.37
lego 35.98 0.982 0.017 80.53
materials 30.48 0.960 0.037 38.50
mic 36.76 0.993 0.006 48.31
ship 31.73 0.907 0.107 63.82
AVG 33.80 0.970 0.031 68.46
Mip-NeRF360 bicycle 25.11 0.746 0.245 1336.45
garden 27.30 0.856 0.122 1327.99
stump 26.66 0.770 0.242 1070.92
room 31.74 0.926 0.197 353.10
counter 29.07 0.914 0.184 277.39
kitchen 31.47 0.931 0.117 414.33
bonsai 32.12 0.946 0.181 295.33
flower 21.36 0.588 0.360 814.24
treehill 22.62 0.636 0.347 812.63
AVG 27.49 0.813 0.222 744.71
Tank&Temples truck 25.38 0.877 0.148 606.99
train 22.00 0.811 0.208 254.91
AVG 23.69 0.844 0.178 430.95
DeepBlending playroom 29.83 0.900 0.247 551.93
drjohnson 29.02 0.898 0.247 775.91
AVG 29.42 0.899 0.247 663.92
BungeeNeRF amsterdam 26.03 0.874 0.170 1458.14
bilbao 26.35 0.864 0.191 1350.37
hollywood 23.44 0.767 0.241 1601.76
pompidou 21.20 0.772 0.266 2169.21
quebec 28.83 0.923 0.156 1468.76
rome 23.34 0.848 0.206 1649.12
AVG 24.87 0.841 0.205 1616.23
Table H: Per-scene results of all evaluated datasets of Scaffold-GS [25].
Datasets Scenes PSNR\uparrow SSIM\uparrow LPIPS\downarrow SIZE\downarrow
Synthetic-NeRF chair 34.96 0.985 0.013 15.50
drums 26.36 0.949 0.045 26.93
ficus 34.66 0.984 0.015 16.46
hotdog 37.82 0.984 0.022 11.31
lego 35.48 0.981 0.018 19.84
materials 30.37 0.958 0.043 23.12
mic 36.37 0.991 0.008 14.83
ship 31.27 0.896 0.119 26.90
AVG 33.41 0.966 0.035 19.36
Mip-NeRF360 bicycle 24.50 0.705 0.306 248.00
garden 27.17 0.842 0.146 271.00
stump 26.27 0.784 0.284 493.00
room 31.93 0.925 0.202 133.00
counter 29.34 0.914 0.191 194.00
kitchen 31.30 0.928 0.126 173.00
bonsai 32.70 0.946 0.185 258.00
flower 21.14 0.566 0.417 253.00
treehill 23.19 0.642 0.410 262.00
AVG 27.50 0.806 0.252 253.89
Tank&Temples truck 25.77 0.883 0.147 107.00
train 22.15 0.822 0.206 66.00
AVG 23.96 0.853 0.177 86.50
DeepBlending playroom 30.62 0.904 0.258 63.00
drjohnson 29.80 0.907 0.250 69.00
AVG 30.21 0.906 0.254 66.00
BungeeNeRF amsterdam 27.16 0.898 0.188 223.00
bilbao 26.60 0.857 0.257 178.00
hollywood 24.49 0.787 0.318 155.00
pompidou 24.94 0.839 0.271 209.00
quebec 30.28 0.936 0.190 159.00
rome 26.23 0.873 0.225 174.00
AVG 26.62 0.865 0.241 183.00
Table I: Notation Table. With slight abuse of notation, we use L3dsubscript𝐿3𝑑L_{3d}italic_L start_POSTSUBSCRIPT 3 italic_d end_POSTSUBSCRIPT and L2dsubscript𝐿2𝑑L_{2d}italic_L start_POSTSUBSCRIPT 2 italic_d end_POSTSUBSCRIPT to represent the number of levels of the 3D and 2D part of the hash grid, respectively.
Notation Shape Definition
𝒙𝒙\bm{x}bold_italic_x 3superscript3\mathbb{R}^{3}blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT A random 3D point
𝝁𝝁\bm{\mu}bold_italic_μ 3superscript3\mathbb{R}^{3}blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT Location of Gaussians in 3DGS [17]
𝚺𝚺\bm{\Sigma}bold_Σ 3×3superscript33\mathbb{R}^{3\times 3}blackboard_R start_POSTSUPERSCRIPT 3 × 3 end_POSTSUPERSCRIPT Covariance matrix of Gaussians
𝑺𝑺\bm{S}bold_italic_S 3×3superscript33\mathbb{R}^{3\times 3}blackboard_R start_POSTSUPERSCRIPT 3 × 3 end_POSTSUPERSCRIPT Scale matrix of Gaussians
𝑹𝑹\bm{R}bold_italic_R 3×3superscript33\mathbb{R}^{3\times 3}blackboard_R start_POSTSUPERSCRIPT 3 × 3 end_POSTSUPERSCRIPT Rotation matrix of Gaussians
α𝛼\alphaitalic_α 1superscript1\mathbb{R}^{1}blackboard_R start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT Opacity of Gaussians after 2D projection
𝒄𝒄\bm{c}bold_italic_c 3superscript3\mathbb{R}^{3}blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT View-dependent color of Gaussians
I𝐼Iitalic_I Number of Gaussians contributed to the rendering
𝑪𝑪\bm{C}bold_italic_C 3superscript3\mathbb{R}^{3}blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT The obtained pixel value after rendering
𝒙asuperscript𝒙𝑎\bm{x}^{a}bold_italic_x start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT 3superscript3\mathbb{R}^{3}blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT Anchor location
𝒇asuperscript𝒇𝑎\bm{f}^{a}bold_italic_f start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT Dasuperscriptsuperscript𝐷𝑎\mathbb{R}^{D^{a}}blackboard_R start_POSTSUPERSCRIPT italic_D start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT Feature of the anchor
𝒍𝒍\bm{l}bold_italic_l 6superscript6\mathbb{R}^{6}blackboard_R start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT Scaling of the anchor
𝒐𝒐\bm{o}bold_italic_o 3Ksuperscript3𝐾\mathbb{R}^{3K}blackboard_R start_POSTSUPERSCRIPT 3 italic_K end_POSTSUPERSCRIPT Offsets of the anchor
𝒜𝒜\mathcal{A}caligraphic_A The set of anchor’s attributes including {𝒇asuperscript𝒇𝑎\bm{f}^{a}bold_italic_f start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT, 𝒍𝒍\bm{l}bold_italic_l, 𝒐𝒐\bm{o}bold_italic_o}
Dasuperscript𝐷𝑎D^{a}italic_D start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT Dimension of 𝒇asuperscript𝒇𝑎\bm{f}^{a}bold_italic_f start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT
K𝐾Kitalic_K Number of offsets per anchor
\mathcal{H}caligraphic_H A 3D-2D mixed binary hash grid
T𝑇Titalic_T Table size of the hash grid at each level
L𝐿Litalic_L Number of levels of the hash grid
Dhsuperscript𝐷D^{h}italic_D start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT Dimension of the vectors of the hash grid
𝜽𝜽\bm{\theta}bold_italic_θ Dhsuperscriptsuperscript𝐷\mathbb{R}^{D^{h}}blackboard_R start_POSTSUPERSCRIPT italic_D start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT A vector of the hash grid
𝒇hsuperscript𝒇\bm{f}^{h}bold_italic_f start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT Dh(L3d+3L2d)superscriptsuperscript𝐷subscript𝐿3𝑑3subscript𝐿2𝑑\mathbb{R}^{D^{h}(L_{3d}+3L_{2d})}blackboard_R start_POSTSUPERSCRIPT italic_D start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ( italic_L start_POSTSUBSCRIPT 3 italic_d end_POSTSUBSCRIPT + 3 italic_L start_POSTSUBSCRIPT 2 italic_d end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT Feature obtained by interpolation of 𝒙asuperscript𝒙𝑎\bm{x}^{a}bold_italic_x start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT in \mathcal{H}caligraphic_H
𝒇𝒇\bm{f}bold_italic_f Dsuperscript𝐷\mathbb{R}^{D}blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT Any of anchor’s attribute vectors {𝒇a,𝒍,𝒐}absentsuperscript𝒇𝑎𝒍𝒐\in\{\bm{f}^{a},\bm{l},\bm{o}\}∈ { bold_italic_f start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT , bold_italic_l , bold_italic_o }
𝒇^bold-^𝒇\bm{\hat{f}}overbold_^ start_ARG bold_italic_f end_ARG Dsuperscript𝐷\mathbb{R}^{D}blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT Quantized version of 𝒇𝒇\bm{f}bold_italic_f
D𝐷Ditalic_D Dimension of 𝒇𝒇\bm{f}bold_italic_f, which {Da,6,3K}absentsuperscript𝐷𝑎63𝐾\in\{D^{a},6,3K\}∈ { italic_D start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT , 6 , 3 italic_K }
𝒒𝒒\bm{q}bold_italic_q 1superscript1\mathbb{R}^{1}blackboard_R start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT Quantization step of 𝒇𝒇\bm{f}bold_italic_f
𝒓𝒓\bm{r}bold_italic_r 1superscript1\mathbb{R}^{1}blackboard_R start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT Quantization step refinement term
𝝁𝝁\bm{\mu}bold_italic_μ Dsuperscript𝐷\mathbb{R}^{D}blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT Estimated mean value for distribution modeling
𝝈𝝈\bm{\sigma}bold_italic_σ Dsuperscript𝐷\mathbb{R}^{D}blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT Estimated standard deviation for distribution modeling
Q0subscript𝑄0Q_{0}italic_Q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT Base quantization step, which varies for 𝒇a,𝒍,𝒐.superscript𝒇𝑎𝒍𝒐\bm{f}^{a},\bm{l},\bm{o}.bold_italic_f start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT , bold_italic_l , bold_italic_o .
N𝑁Nitalic_N Total number of anchors
hfsubscript𝑓h_{f}italic_h start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT Occurrence frequency of “+1” in \mathcal{H}caligraphic_H
M+subscript𝑀M_{+}italic_M start_POSTSUBSCRIPT + end_POSTSUBSCRIPT Total number of “+1” \mathcal{H}caligraphic_H
Msubscript𝑀M_{-}italic_M start_POSTSUBSCRIPT - end_POSTSUBSCRIPT Total number of “-1” \mathcal{H}caligraphic_H
LScaffoldsubscript𝐿ScaffoldL_{\text{Scaffold}}italic_L start_POSTSUBSCRIPT Scaffold end_POSTSUBSCRIPT The loss item used in Scaffold-GS [25]
Lentropysubscript𝐿entropyL_{\text{entropy}}italic_L start_POSTSUBSCRIPT entropy end_POSTSUBSCRIPT Entropy loss for measuring bits of 𝒜𝒜\mathcal{A}caligraphic_A
Lhashsubscript𝐿hashL_{\text{hash}}italic_L start_POSTSUBSCRIPT hash end_POSTSUBSCRIPT Entropy loss for measuring bits of \mathcal{H}caligraphic_H
Lmsubscript𝐿𝑚L_{m}italic_L start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT Masking loss
Loss𝐿𝑜𝑠𝑠Lossitalic_L italic_o italic_s italic_s The total loss
λesubscript𝜆𝑒\lambda_{e}italic_λ start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT Tradeoff parameter to achieve variable birate
λmsubscript𝜆𝑚\lambda_{m}italic_λ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT Tradeoff parameter to balance masking ratio
MLPqsubscriptMLPq{\rm MLP_{q}}roman_MLP start_POSTSUBSCRIPT roman_q end_POSTSUBSCRIPT The MLP to deduce r𝑟ritalic_r from 𝒇hsuperscript𝒇\bm{f}^{h}bold_italic_f start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT
MLPcsubscriptMLPc{\rm MLP_{c}}roman_MLP start_POSTSUBSCRIPT roman_c end_POSTSUBSCRIPT The MLP to deduce μ𝜇\muitalic_μ and σ𝜎\sigmaitalic_σ from 𝒇hsuperscript𝒇\bm{f}^{h}bold_italic_f start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT
ϕitalic-ϕ\phiitalic_ϕ Probability density function of Gaussian distribution
ΦΦ\Phiroman_Φ Cumulative distribution function of Gaussian distribution