HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.

  • failed: epic

Authors: achieve the best HTML results from your LaTeX submissions by following these best practices.

License: arXiv.org perpetual non-exclusive license
arXiv:2403.06912v3 [cs.CV] 24 Mar 2024

DNGaussian: Optimizing Sparse-View 3D Gaussian Radiance Fields
with Global-Local Depth Normalization

Jiahe Li11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT, Jiawei Zhang11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT, Xiao Bai11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT, ** Zheng11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT, Xin Ning22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT, Jun Zhou33{}^{3}start_FLOATSUPERSCRIPT 3 end_FLOATSUPERSCRIPT, Lin Gu4,545{}^{4,5}start_FLOATSUPERSCRIPT 4 , 5 end_FLOATSUPERSCRIPT
11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPTSchool of Computer Science and Engineering, State Key Laboratory of
Complex & Critical Software Environment,  Jiangxi Research Institute,  Beihang University
22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPTInstitute of Semiconductors, Chinese Academy of Sciences
33{}^{3}start_FLOATSUPERSCRIPT 3 end_FLOATSUPERSCRIPTSchool of Information and Communication Technology, Griffith University
44{}^{4}start_FLOATSUPERSCRIPT 4 end_FLOATSUPERSCRIPTRIKEN AIP  55{}^{5}start_FLOATSUPERSCRIPT 5 end_FLOATSUPERSCRIPTThe University of Tokyo
Corresponding author: Xiao Bai ([email protected]).
Abstract

Radiance fields have demonstrated impressive performance in synthesizing novel views from sparse input views, yet prevailing methods suffer from high training costs and slow inference speed. This paper introduces DNGaussian, a depth-regularized framework based on 3D Gaussian radiance fields, offering real-time and high-quality few-shot novel view synthesis at low costs. Our motivation stems from the highly efficient representation and surprising quality of the recent 3D Gaussian Splatting, despite it will encounter a geometry degradation when input views decrease. In the Gaussian radiance fields, we find this degradation in scene geometry primarily lined to the positioning of Gaussian primitives and can be mitigated by depth constraint. Consequently, we propose a Hard and Soft Depth Regularization to restore accurate scene geometry under coarse monocular depth supervision while maintaining a fine-grained color appearance. To further refine detailed geometry resha**, we introduce Global-Local Depth Normalization, enhancing the focus on small local depth changes. Extensive experiments on LLFF, DTU, and Blender datasets demonstrate that DNGaussian outperforms state-of-the-art methods, achieving comparable or better results with significantly reduced memory cost, a 25×25\times25 × reduction in training time, and over 3000×3000\times3000 × faster rendering speed. Code is available at: https://github.com/Fictionarry/DNGaussian .

[Uncaptioned image]
Figure 1: Comparison of the state-of-the-arts FreeNeRF [53] and SparseNeRF [42] with our DNGaussian utilizing three views for training. DNGaussian stands out by delivering comparably high-quality synthesized views and superior details with a remarkable 25× reduction in time and significantly lower memory overhead during training, while attaining the fastest and the only real-time rendering speed of 300 FPS. The point cloud of Gaussians illustrates the detailed and explainable spatial representation learned through our method.
*{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPTCorresponding author: Xiao Bai ([email protected]).

1 Introduction

Novel view synthesis with sparse inputs poses a challenge for radiance fields. Recent advances in neural radiance fields (NeRF) have excelled in reconstructing photorealistic appearance and accurate geometry from just a handful of input views [27, 53, 42, 35, 49, 36, 5, 55, 16]. However, most sparse-view NeRFs are implemented with low processing speed and substantial memory consumption, resulting in high time and computational costs that restrict their practical applications. While some methods [36, 38, 49] achieve faster inference speed with grid-based backbones [26, 37, 13], they often suffer from trade-offs, leading to either high training costs or compromised rendering quality.

Recently, 3D Gaussian Splatting [18] has introduced an unstructured 3D Gaussian radiance field, employing a set of 3D Gaussian primitives to achieve remarkable success in rapid, high-quality, and low-cost novel view synthesis, when learned from color dense input views. Even with only sparse inputs, it can still partially retain the surprising ability to reconstruct some clear and detailed local features. Nevertheless, the decrease in view constraints makes a significant portion of scene geometry be incorrectly learned, resulting in failures in novel view synthesis, as illustrated in Figure 2. Inspired by the success of earlier depth-regularized sparse-view NeRFs [42, 36], this paper explores distilling depth information from pre-trained monocular depth estimators to rectify the Gaussian fields of the ill-learned geometry, and introduce the Depth Normalization Regularized Sparse-view 3D Gaussian Radiance Fields (DNGaussian) to pursue higher quality and efficiency for few-shot novel view synthesis.


Refer to caption
Figure 2: 3D Gaussian Splatting [18] exhibits its potential to reconstruct some fine details (green box) from sparse input views. Nevertheless, the reduced input views would significantly degrade geometry and cause failed reconstruction (orange box). After applying depth regularization, DNGaussian successfully recovers accurate geometry and synthesizes high-quality novel views.

Despite sharing a similar form of depth rendering, the depth regularization for 3D Gaussian radiance fields differs significantly from that employed by NeRF. Firstly, existing depth regularization strategies for NeRFs commonly employ depth to regularize the entire model, which creates a potential geometry conflict in the Gaussian fields that adversely affects quality. Specifically, this practice forces the shape of Gaussians to fit the smooth monocular depth rather than the complex color appearance and thus results in loss of details and blurred appearance. Considering that the basis of scene geometry lies in the position of the Gaussian primitives rather than their shape, we freeze the shape parameters and propose a Hard and Soft Depth Regularization to enable spatial resha** by encouraging movement among the primitives. During regularization, we propose rendering two types of depth to independently adjust the center and opacity of Gaussians without changing their shape, therefore striking a balance between the fitting of complex color appearance and smooth coarse depth.

Moreover, Gaussian radiance fields are more sensitive to small depth errors when compared to NeRF, which may result in a noisy distribution of primitives and failures in regions with complex textures. Existing scale-invariant depth losses often opt to align depth maps to a fixed scale, which leads to the overlook of small losses. To address this issue, we introduce the Global-Local Depth Normalization into the depth loss function, thus encouraging the learning of small local depth changes in a scale-invariant way. With the local and global scale normalization, our method guides the loss function to refocus on small local errors while maintaining knowledge on the absolute scale, to enhance the detailed geometry resha** process for depth regularization.

Integrating the two proposed techniques, DNGaussian synthesizes views with competitive quality and superior details compared to state-of-the-art methods in multiple sparse-view settings on LLFF, Blender, and DTU datasets. This advantage is further enriched by substantially lower memory costs, 25×25\times25 × reduction of training time, and over 3000×3000\times3000 × faster rendering speed. The experiments also demonstrate our method’s universal ability to fit complex scenes, wide-ranging views, and multiple materials.

Our main contributions are the following:

  • A Hard and Soft Depth Regularization to constrain the geometry of 3D Gaussian radiance fields by encouraging the movement of Gaussians, which enables the coarse-depth regularized space resha** without compromising fine-grained color performance.

  • A Global-Local Depth Normalization that normalizes depth patches on local scales to achieve a refocus on small local depth changes, thereby improving the reconstruction of detail appearance for 3D Gaussian radiance fields.

  • A DNGaussian framework for fast and high-quality few-shot novel view synthesis, which combines the above two techniques and achieves competitive quality across multiple benchmarks compared to the state-of-the-art methods, excelling in capturing details with significantly lower training costs and real-time rendering.

To the best of our knowledge, we are the first attempt to analyze and address the depth regularization problem for 3D Gaussian Splatting under coarse depth cues. We hope this paper can inspire more ideas for optimizing radiance fields in under-constrained situations.

2 Related Work

Radiance Fields for Novel View Synthesis. Novel view synthesis aims to generate unseen views of the same object or scene from a set of given images [60, 1]. Neural Radiance Fields (NeRF) [25] uses a large MLP to represent 3D scenes and renders via volume rendering. However, its speed is slow both in training and inference. The following improvements mainly pursue either higher quality [2, 3] or efficiency [6, 26, 37, 12, 54, 21, 15], but hard to achieve both. The most recent unstructured radiance fields [7, 52, 18] utilize a set of primitives to represent scenes. Among them, 3D Gaussian Splatting [18] represents radiance fields by a set of anisotropic 3D Gaussians and renders with a differentiable splatting. This approach achieves great success in fast and high-quality reconstruction for complex real scenes. While this method excels with dense input views and has achieved success in various 3D tasks [23, 47, 39], its reconstruction with sparse view inputs remains an open problem. Also, issues such as how to apply additional constraints for improvement are still unsolved and worthy of discussion.

Few-shot Novel View Synthesis. Few-shot novel view synthesis aims to generate novel views from only a set of sparse input views. Many works address the problem by introducing regularization strategies specified for NeRF [53, 27, 19, 11]. Some pre-trained methods aim to design a generative model and train it on large datasets [5, 55, 9, 62, 20], while others [49, 16] take pre-trained models as a type of loss to regularize the training process with well-learned knowledge. Depth distilling [11, 31, 36, 42] is also a powerful technique for sparse-view neural fields. However, limited by their powerful but slow backbones or the complex pre-trained models, most of these methods are costly in both training and inference. Although some methods [36, 38, 49] have improved inference efficiency via grid-based backbones, they also suffer from trade-offs like higher training costs or lower quality. More recently, some work [22, 32, 28] enable zero-shot novel view synthesis with even one input by diffusion model priors, but can hardly handle complex scenes and with lower efficiency.

Depth Supervision in Sparse-view Neural Fields. As a classic cue in many 3D vision tasks [46, 44, 41, 58, 43, 61], depth information has been widely used to supervise sparse-view neural fields. The first group [11, 31] is to extract accurate but sparse depth values from reliable point clouds, and the second [56, 14, 36, 40, 42] distills depth knowledge from current powerful monocular depth estimators [30, 29]. Considering point clouds are sparse and not available in many sparse-view cases, monocular depth shows its advantage in density and robustness for our tasks. To tackle the scale ambiguity and error of monocular depths, some previous works and concurrent sparse-view 3DGS methods have introduced various scale-invariant losses [56, 10, 36, 50, 63] including depth ranking loss [51, 42], however, all of which are not optimal for us. Firstly, flexible Gaussians are more sensitive to wrong depth cues, requiring extra designs for regularization. Also, these losses align the depth to a certain fixed global scale, which may ignore minor local depth changes. This overlook can lead to a noisy primitive distribution, particularly in regions with intricate textures. Besides, we notice an HDN loss [57] that can preserve details in monocular depth estimation. Nevertheless, it is also unsuitable as its reliance on multi-scale patches would bring long-distance errors and compromise geometric accuracy.

3 Method


Refer to caption
Figure 3: The Framework of DNGaussian. Our framework starts from a random initialization and consists of a Color Supervision module and a Depth Regularization module. The optimization process of Color Supervision mainly inherits from 3D Gaussian Splatting [18] except for a Neural Color Renderer. In the depth regularization, we render a Hard Depth and a Soft Depth for the input view, and separately calculate the losses of the pre-generated monocular depth map with the proposed Global-Local Depth Normalization. Finally, the output Gaussian field enables efficient and high-quality novel view synthesis.

3.1 Preliminary for 3D Gaussian Splatting

Representation. 3D Gaussian splatting [18] represents 3D information with a set of 3D Gaussians. It computes pixel-wise color 𝒞𝒞\mathcal{C}caligraphic_C with a set of 3D Gaussian primitives θ𝜃\thetaitalic_θ, view pose P𝑃Pitalic_P, and the camera parameter involving the center o𝑜oitalic_o.

Specifically, a Gaussian primitive can be described with a center μ3𝜇superscript3\mu\in\mathbb{R}^{3}italic_μ ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT, a scaling factor s3𝑠superscript3s\in\mathbb{R}^{3}italic_s ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT, and a rotation quaternion q4𝑞superscript4q\in\mathbb{R}^{4}italic_q ∈ blackboard_R start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT. The basis function of the i𝑖iitalic_i-th primitive 𝒢isubscript𝒢𝑖\mathcal{G}_{i}caligraphic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is in the form of:

𝒢i(x)=e12(xμi)TΣi1(xμi),subscript𝒢𝑖𝑥superscript𝑒12superscript𝑥subscript𝜇𝑖𝑇superscriptsubscriptΣ𝑖1𝑥subscript𝜇𝑖\mathcal{G}_{i}(x)=e^{-\frac{1}{2}(x-\mu_{i})^{T}\Sigma_{i}^{-1}(x-\mu_{i})},caligraphic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) = italic_e start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( italic_x - italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT roman_Σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_x - italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT , (1)

where the covariance matrix ΣΣ\Sigmaroman_Σ can be calculated from the scale s𝑠sitalic_s and rotation q𝑞qitalic_q. For rendering purposes, the Gaussian primitive also retains an opacity value α𝛼\alpha\in\mathbb{R}italic_α ∈ blackboard_R and a K𝐾Kitalic_K-dimensional color feature fK𝑓superscript𝐾f\in\mathbb{R}^{K}italic_f ∈ blackboard_R start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT. Then θi={μi,si,qi,αi,fi}subscript𝜃𝑖subscript𝜇𝑖subscript𝑠𝑖subscript𝑞𝑖subscript𝛼𝑖subscript𝑓𝑖\theta_{i}=\{\mu_{i},s_{i},q_{i},\alpha_{i},f_{i}\}italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = { italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } is the parameters for the i𝑖iitalic_i-th Gaussian.

Rendering. 3D Gaussian Splatting utilizes a point-based rendering to compute the color 𝒞𝒞\mathcal{C}caligraphic_C of pixel xpsubscript𝑥𝑝x_{p}italic_x start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT by blending N𝑁Nitalic_N ordered Gaussians overlap** the pixel:

𝒞(xp)=iNciα~ij=1i1(1α~j),𝒞subscript𝑥𝑝subscript𝑖𝑁subscript𝑐𝑖subscript~𝛼𝑖superscriptsubscriptproduct𝑗1𝑖11subscript~𝛼𝑗\mathcal{C}(x_{p})=\sum_{i\in N}{c_{i}\widetilde{\alpha}_{i}\prod_{j=1}^{i-1}(% 1-\widetilde{\alpha}_{j})},caligraphic_C ( italic_x start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_i ∈ italic_N end_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT over~ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∏ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i - 1 end_POSTSUPERSCRIPT ( 1 - over~ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) , (2)

where cisubscript𝑐𝑖c_{i}italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the decoded color of feature f𝑓fitalic_f.

Different from NeRF’s ray sampling strategy, the involved N𝑁Nitalic_N Gaussians are gathered by a well-optimized rasterizer according to xpsubscript𝑥𝑝x_{p}italic_x start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT, the camera parameter, the view pose P𝑃Pitalic_P, and a set of pre-defined roles. And the rendering opacity α~~𝛼\widetilde{\alpha}over~ start_ARG italic_α end_ARG of N𝑁Nitalic_N primitives are calculated by α𝛼\alphaitalic_α and their projected 2D Gaussians 𝒢projsuperscript𝒢𝑝𝑟𝑜𝑗\mathcal{G}^{proj}caligraphic_G start_POSTSUPERSCRIPT italic_p italic_r italic_o italic_j end_POSTSUPERSCRIPT on image plane :

α~i=αi𝒢iproj(xp).subscript~𝛼𝑖subscript𝛼𝑖subscriptsuperscript𝒢𝑝𝑟𝑜𝑗𝑖subscript𝑥𝑝\widetilde{\alpha}_{i}=\alpha_{i}\mathcal{G}^{proj}_{i}(x_{p}).over~ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT caligraphic_G start_POSTSUPERSCRIPT italic_p italic_r italic_o italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) . (3)

Then, similar to NeRF, we can represent the pixel-wise depth 𝒟𝒟\mathcal{D}caligraphic_D with the distance to the camera center o𝑜oitalic_o:

𝒟(xp)=iNμio2×α~ij=1i1(1α~j).𝒟subscript𝑥𝑝subscript𝑖𝑁subscriptnormsubscript𝜇𝑖𝑜2subscript~𝛼𝑖superscriptsubscriptproduct𝑗1𝑖11subscript~𝛼𝑗\mathcal{D}(x_{p})=\sum_{i\in N}{||\mu_{i}-o||_{2}}\times\widetilde{\alpha}_{i% }\prod_{j=1}^{i-1}(1-\widetilde{\alpha}_{j}).caligraphic_D ( italic_x start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_i ∈ italic_N end_POSTSUBSCRIPT | | italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_o | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT × over~ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∏ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i - 1 end_POSTSUPERSCRIPT ( 1 - over~ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) . (4)

Optimzation. 3D Gaussian Splatting optimizes the parameters θ𝜃\thetaitalic_θ for all Gaussians through gradient descent under color supervision. During the optimization process, it identifies and duplicates the most active primitives to better represent intricate textures, simultaneously removing redundant primitives. In this work, we inherit these optimization strategies for color supervision.

Initialization. To start from a better geometry, the method suggests utilizing the point cloud from COLMAP [34, 33] or other SfMs to initialize the Gaussians. Instead, considering the instability of point clouds in sparse-view situations, we initialize our method with a random set of Gaussians.

3.2 Depth Regularization for Gaussians

Despite sharing a similar depth computation, existing depth regularization for NeRFs cannot transfer to 3D Gaussian radiance fields due to the huge differences. First, a target conflict between color and depth would occur in the extra parameters. Also, previous regularization for the continuous NeRF only focuses on density, for which it can hardly work well on the discrete and flexible Gaussian primitives.

Shape Freezing. 3D Gaussian radiance fields possess four optimizable parameters {μ,s,q,α}𝜇𝑠𝑞𝛼\{\mu,s,q,\alpha\}{ italic_μ , italic_s , italic_q , italic_α } that can directly influence the depth, which is more complex than NeRF. Since the mono-depth is much smoother and easier to fit than color, apply an all-parameter depth regularization on the entire model, which is widely used in previous sparse-view neural fields [56, 42, 14, 10, 51], would lead the shape parameters to overfit the target depth map and cause blurry appearance. Thus, these parameters must be treated differently. Since the scene geometry is mainly represented by the position distribution of Gaussian primitives, we regard the center μ𝜇\muitalic_μ and opacity α𝛼\alphaitalic_α as the most important parameters to regularize, for they separately stand for the position itself and the occupancy of a position. Furthermore, to reduce the negative influence for color reconstruction, we freeze the scaling s𝑠sitalic_s and rotation q𝑞qitalic_q in the depth regularization.

Hard Depth Regularization. To achieve the spatial resha** of the Gaussian fields, we first propose a Hard Depth Regularization that encourages the movement of the nearest Gaussians, which are expected to compose surfaces but often cause noises and artifacts. Considering the predicted depth is rendered with the mixture of multiple Gaussians and reweighted by the cumulative product α~~𝛼\widetilde{\alpha}over~ start_ARG italic_α end_ARG, we manually apply a large opacity value τ𝜏\tauitalic_τ to all Gaussians. Then, we render a “hard depth” that mainly consists of the nearest Gaussians on the ray shot from camera center o𝑜oitalic_o and across the pixel xpsubscript𝑥𝑝x_{p}italic_x start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT:

𝒟hard(xp)=iNτ(1τ)i1𝒢iproj(xp)μio2.subscript𝒟𝑎𝑟𝑑subscript𝑥𝑝subscript𝑖𝑁𝜏superscript1𝜏𝑖1subscriptsuperscript𝒢𝑝𝑟𝑜𝑗𝑖subscript𝑥𝑝subscriptnormsubscript𝜇𝑖𝑜2\mathcal{D}_{hard}(x_{p})=\sum_{i\in N}{\tau(1-\tau)^{i-1}\mathcal{G}^{proj}_{% i}(x_{p})||\mu_{i}-o||_{2}}.caligraphic_D start_POSTSUBSCRIPT italic_h italic_a italic_r italic_d end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_i ∈ italic_N end_POSTSUBSCRIPT italic_τ ( 1 - italic_τ ) start_POSTSUPERSCRIPT italic_i - 1 end_POSTSUPERSCRIPT caligraphic_G start_POSTSUPERSCRIPT italic_p italic_r italic_o italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) | | italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_o | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT . (5)

Since now only the center μ𝜇\muitalic_μ is in optimization, Gaussians at wrong positions cannot avoid being regularized by decreasing their opacity or changing shapes, and thus their centers μ𝜇\muitalic_μ move. The regularization is implemented by a similarity loss at the target image area 𝒫𝒫\mathcal{P}caligraphic_P to encourage the hard depth 𝒟hardsubscript𝒟𝑎𝑟𝑑\mathcal{D}_{hard}caligraphic_D start_POSTSUBSCRIPT italic_h italic_a italic_r italic_d end_POSTSUBSCRIPT close to the monocular depth 𝒟~~𝒟\widetilde{\mathcal{D}}over~ start_ARG caligraphic_D end_ARG:

hard(𝒫)=similar(𝒟hard(𝒫),𝒟~(𝒫)).subscript𝑎𝑟𝑑𝒫subscript𝑠𝑖𝑚𝑖𝑙𝑎𝑟subscript𝒟𝑎𝑟𝑑𝒫~𝒟𝒫\mathcal{R}_{hard}(\mathcal{P})=\mathcal{L}_{similar}(\mathcal{D}_{hard}(% \mathcal{P}),\widetilde{\mathcal{D}}(\mathcal{P})).caligraphic_R start_POSTSUBSCRIPT italic_h italic_a italic_r italic_d end_POSTSUBSCRIPT ( caligraphic_P ) = caligraphic_L start_POSTSUBSCRIPT italic_s italic_i italic_m italic_i italic_l italic_a italic_r end_POSTSUBSCRIPT ( caligraphic_D start_POSTSUBSCRIPT italic_h italic_a italic_r italic_d end_POSTSUBSCRIPT ( caligraphic_P ) , over~ start_ARG caligraphic_D end_ARG ( caligraphic_P ) ) . (6)

Soft Depth Regularization. Only regularizing on “hard depth” is insufficient due to the absence of opacity optimization. We also expect to ensure the accuracy of the real rendered “soft depth”, otherwise, the surface may become semitransparent and cause hollowness. From this perspective, we additionally freeze the Gaussian center μ𝜇\muitalic_μ (denoted by μˇˇ𝜇\check{\mu}overroman_ˇ start_ARG italic_μ end_ARG) to avoid negative influence caused by the center moving, and propose Soft Depth Regularization for the tuning of the opacity α𝛼\alphaitalic_α:

𝒟soft(xp)=iNμiˇo2×α~ij=1i1(1α~j),subscript𝒟𝑠𝑜𝑓𝑡subscript𝑥𝑝subscript𝑖𝑁subscriptnormˇsubscript𝜇𝑖𝑜2subscript~𝛼𝑖superscriptsubscriptproduct𝑗1𝑖11subscript~𝛼𝑗\displaystyle\mathcal{D}_{soft}(x_{p})=\sum_{i\in N}{||\check{\mu_{i}}-o||_{2}% }\times\widetilde{\alpha}_{i}\prod_{j=1}^{i-1}(1-\widetilde{\alpha}_{j}),caligraphic_D start_POSTSUBSCRIPT italic_s italic_o italic_f italic_t end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_i ∈ italic_N end_POSTSUBSCRIPT | | overroman_ˇ start_ARG italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG - italic_o | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT × over~ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∏ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i - 1 end_POSTSUPERSCRIPT ( 1 - over~ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) , (7)
soft(𝒫)=similar(𝒟soft(𝒫),𝒟~(𝒫)).subscript𝑠𝑜𝑓𝑡𝒫subscript𝑠𝑖𝑚𝑖𝑙𝑎𝑟subscript𝒟𝑠𝑜𝑓𝑡𝒫~𝒟𝒫\displaystyle\mathcal{R}_{soft}(\mathcal{P})=\mathcal{L}_{similar}(\mathcal{D}% _{soft}(\mathcal{P}),\widetilde{\mathcal{D}}(\mathcal{P})).caligraphic_R start_POSTSUBSCRIPT italic_s italic_o italic_f italic_t end_POSTSUBSCRIPT ( caligraphic_P ) = caligraphic_L start_POSTSUBSCRIPT italic_s italic_i italic_m italic_i italic_l italic_a italic_r end_POSTSUBSCRIPT ( caligraphic_D start_POSTSUBSCRIPT italic_s italic_o italic_f italic_t end_POSTSUBSCRIPT ( caligraphic_P ) , over~ start_ARG caligraphic_D end_ARG ( caligraphic_P ) ) .

With both the Hard and Soft Depth Regularization, we constrain the nearest Gaussians to stay in a suitable position with high opacity, therefore composing complete surfaces.

3.3 Global-Local Depth Normalization

Previous depth-supervised neural fields usually build the depth loss on the source scales of the depth maps[56, 36, 10, 14, 42]. This type of alignment measures all losses via a fixed scale based on the statistics of a large area. As a result, it might lead to the overlooking of small errors, particularly when dealing with multiple objectives such as color reconstruction or a wide range of depth variance. This overlook may matter not much in previous NeRF-based works, but can raise heavier problems in the Gaussian radiance fields.

In the Gaussian radiance fields, correcting small depth errors is more challenging because it primarily relies on the movement of Gaussian primitives, a process that happens with a minor learning rate. Also, if the primitives have not been corrected in position during depth regularization, they will become float noises and cause failures, especially in regions with detailed appearance where gathering numerous primitives, as shown in Figure 4.

Local Depth Normalization. To solve this problem, we make the loss function refocus on small errors by introducing a patch-wise local normalization. Specifically, we cut a whole depth map into small patches and normalize the patch 𝒫𝒫\mathcal{P}caligraphic_P of predicted depth and monocular depth with the mean value of 00 and standard deviation of near to 1111:

𝒟LN(x)=𝒟(x)mean(𝒟(𝒫))std(𝒟(𝒫))+ϵ,s.t.x𝒫,\mathcal{D}^{LN}(x)=\frac{\mathcal{D}(x)-\text{mean}(\mathcal{D}(\mathcal{P}))% }{\text{std}(\mathcal{D}(\mathcal{P}))+\epsilon},\quad\mathrm{s.t.}\ \ x\in% \mathcal{P},caligraphic_D start_POSTSUPERSCRIPT italic_L italic_N end_POSTSUPERSCRIPT ( italic_x ) = divide start_ARG caligraphic_D ( italic_x ) - mean ( caligraphic_D ( caligraphic_P ) ) end_ARG start_ARG std ( caligraphic_D ( caligraphic_P ) ) + italic_ϵ end_ARG , roman_s . roman_t . italic_x ∈ caligraphic_P , (8)

where ϵitalic-ϵ\epsilonitalic_ϵ is a value for numerical stability. Since then, all patches have been normalized on a local scale and the loss can be calculated inside. Later, we apply the Local Depth Normalization to the Hard and Soft Depth Regularization to help with geometry resha**.

Global Depth Normalization. In contrast to focusing on small local losses, we also need a global view to learn an overall shape. To fill the lack of global scale, we further add a Global Depth Normalization in the depth regularization. This makes the depth loss aware of the global scale while preserving local relevance. Similar to the local one, we apply a patch-wise normalization to free the depth from the source scale and focus on local changes. The only difference is here we use a global standard deviation of the whole image depth 𝒟subscript𝒟\mathcal{D}_{\mathcal{I}}caligraphic_D start_POSTSUBSCRIPT caligraphic_I end_POSTSUBSCRIPT of image \mathcal{I}caligraphic_I to replace that of the patch:

𝒟GN(x)=𝒟(x)mean(𝒟(𝒫))std(𝒟I),superscript𝒟𝐺𝑁𝑥𝒟𝑥mean𝒟𝒫stdsubscript𝒟𝐼\displaystyle\mathcal{D}^{GN}(x)=\frac{\mathcal{D}(x)-\text{mean}(\mathcal{D}(% \mathcal{P}))}{\text{std}(\mathcal{D}_{I})},caligraphic_D start_POSTSUPERSCRIPT italic_G italic_N end_POSTSUPERSCRIPT ( italic_x ) = divide start_ARG caligraphic_D ( italic_x ) - mean ( caligraphic_D ( caligraphic_P ) ) end_ARG start_ARG std ( caligraphic_D start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT ) end_ARG , (9)
s.t.x𝒫,\displaystyle\mathrm{s.t.}\ \ x\in\mathcal{P},\ roman_s . roman_t . italic_x ∈ caligraphic_P , 𝒫.𝒫\displaystyle\mathcal{P}\subseteq\mathcal{I}.caligraphic_P ⊆ caligraphic_I .

In addition, our patch-wise normalization can also avoid long-distance errors in the monocular depth by driving the learning of locally relative depth, which serves a similar effect as depth rank distillation [42, 51]. But differently, for geometry resha** purposes, we also encourage the model to learn the absolute depth change rather than ignoring it.

Refer to caption
Figure 4: A fixed global scale pays little attention to the small depth errors even under L1 loss, which leads to noisy primitives and causes failures in novel view (yellow box). Our Global-Local Depth Normalization refocuses on small errors via local scale and helps reconstruct a more accurate appearance (green box).

3.4 Training Details

Loss Function The loss function consists of three parts: color reconstruction loss colorsubscript𝑐𝑜𝑙𝑜𝑟\mathcal{L}_{color}caligraphic_L start_POSTSUBSCRIPT italic_c italic_o italic_l italic_o italic_r end_POSTSUBSCRIPT, hard depth regularization hardsubscript𝑎𝑟𝑑\mathcal{R}_{hard}caligraphic_R start_POSTSUBSCRIPT italic_h italic_a italic_r italic_d end_POSTSUBSCRIPT and soft depth regularization softsubscript𝑠𝑜𝑓𝑡\mathcal{R}_{soft}caligraphic_R start_POSTSUBSCRIPT italic_s italic_o italic_f italic_t end_POSTSUBSCRIPT. Following 3D Gaussian Splatting, the color reconstruction loss is a combination of L1 reconstruction loss and a D-SSIM term of the rendering image ^^\hat{\mathcal{I}}over^ start_ARG caligraphic_I end_ARG and ground-truth \mathcal{I}caligraphic_I:

color=1(^,)+λDSSIM(^,).subscript𝑐𝑜𝑙𝑜𝑟subscript1^𝜆subscriptDSSIM^\mathcal{L}_{color}=\mathcal{L}_{1}(\hat{\mathcal{I}},\mathcal{I})+\lambda% \mathcal{L}_{\mathrm{D-SSIM}}(\hat{\mathcal{I}},\mathcal{I}).caligraphic_L start_POSTSUBSCRIPT italic_c italic_o italic_l italic_o italic_r end_POSTSUBSCRIPT = caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( over^ start_ARG caligraphic_I end_ARG , caligraphic_I ) + italic_λ caligraphic_L start_POSTSUBSCRIPT roman_D - roman_SSIM end_POSTSUBSCRIPT ( over^ start_ARG caligraphic_I end_ARG , caligraphic_I ) . (10)

The depth regularization hardsubscript𝑎𝑟𝑑\mathcal{R}_{hard}caligraphic_R start_POSTSUBSCRIPT italic_h italic_a italic_r italic_d end_POSTSUBSCRIPT and softsubscript𝑠𝑜𝑓𝑡\mathcal{R}_{soft}caligraphic_R start_POSTSUBSCRIPT italic_s italic_o italic_f italic_t end_POSTSUBSCRIPT all include a local and a global term separately from our two kinds of depth normalization. We take the L2 loss to measure the similarity. Both of the regularizations are in the form of:

T=2(𝒟TGN,𝒟~GN)+γ2(𝒟TLN,𝒟~LN),subscript𝑇subscript2superscriptsubscript𝒟𝑇𝐺𝑁superscript~𝒟𝐺𝑁𝛾subscript2superscriptsubscript𝒟𝑇𝐿𝑁superscript~𝒟𝐿𝑁\mathcal{R}_{T}=\mathcal{L}_{2}(\mathcal{D}_{T}^{GN},\widetilde{\mathcal{D}}^{% GN})+\gamma\mathcal{L}_{2}(\mathcal{D}_{T}^{LN},\widetilde{\mathcal{D}}^{LN}),caligraphic_R start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT = caligraphic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( caligraphic_D start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_G italic_N end_POSTSUPERSCRIPT , over~ start_ARG caligraphic_D end_ARG start_POSTSUPERSCRIPT italic_G italic_N end_POSTSUPERSCRIPT ) + italic_γ caligraphic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( caligraphic_D start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L italic_N end_POSTSUPERSCRIPT , over~ start_ARG caligraphic_D end_ARG start_POSTSUPERSCRIPT italic_L italic_N end_POSTSUPERSCRIPT ) , (11)

where T𝑇Titalic_T stands for hard𝑎𝑟𝑑harditalic_h italic_a italic_r italic_d or soft𝑠𝑜𝑓𝑡softitalic_s italic_o italic_f italic_t. In practice, we reserve an error tolerance for the L2 loss to relax the constraint. The full loss function is formulated by:

=color+hard+soft.subscript𝑐𝑜𝑙𝑜𝑟subscript𝑎𝑟𝑑subscript𝑠𝑜𝑓𝑡\mathcal{L}=\mathcal{L}_{color}+\mathcal{R}_{hard}+\mathcal{R}_{soft}.caligraphic_L = caligraphic_L start_POSTSUBSCRIPT italic_c italic_o italic_l italic_o italic_r end_POSTSUBSCRIPT + caligraphic_R start_POSTSUBSCRIPT italic_h italic_a italic_r italic_d end_POSTSUBSCRIPT + caligraphic_R start_POSTSUBSCRIPT italic_s italic_o italic_f italic_t end_POSTSUBSCRIPT . (12)
LLFF DTU
Setting PSNR \uparrow LPIPS \downarrow SSIM \uparrow AVGE \downarrow PSNR \uparrow LPIPS \downarrow SSIM \uparrow AVGE \downarrow
SRF [8] 12.34 0.591 0.250 0.313 15.32 0.304 0.671 0.171
PixelNeRF [55] 7.93 0.682 0.272 0.461 16.82 0.270 0.695 0.147
MVSNeRF [5] Trained on DTU 17.25 0.356 0.557 0.171 18.63 0.197 0.769 0.113
SRF ft [8] 17.07 0.529 0.436 0.203 15.68 0.281 0.698 0.162
PixelNeRF ft [55] 16.17 0.512 0.438 0.217 18.95 0.269 0.710 0.125
MVSNeRF ft [5] Trained on DTU Fine-tuned per Scene 17.88 0.327 0.584 0.157 18.54 0.197 0.769 0.113
Mip-NeRF [2] 14.62 0.495 0.351 0.246 8.68 0.353 0.571 0.323
DietNeRF [16] 14.94 0.496 0.370 0.240 11.85 0.314 0.633 0.243
RegNeRF [27] 19.08 0.336 0.587 0.149 18.89 0.190 0.745 0.112
FreeNeRF [53] 19.63 0.308 0.612 0.134 19.92 0.182 0.787 0.098
SparseNeRF [42] Optimized per Scene 19.86 0.328 0.624 0.127 19.55 0.201 0.769 0.102
3DGS [18] 15.52 0.405 0.408 0.209 10.99 0.313 0.585 0.252
3DGS† 16.46 0.401 0.440 0.192 14.74 0.249 0.672 0.169
DNGaussian (Ours) Optimized per Scene 19.12 0.294 0.591 0.132 18.91 0.176 0.790 0.102
† with the same hyperparameters and the neural color renderer as DNGaussian .
Table 1: Quantitative Comparison on LLFF and DTU for 3 input views. The best, second-best, and third-best entries are marked in red, orange, and yellow, respectively. Notably, the Gaussian-based methods directly show the background color on the meaningless invisible places, which would cause lower metrics, especially in PSNR.
Method PSNR \uparrow SSIM \uparrow LPIPS \downarrow
NeRF [25] 14.934 0.687 0.318
NeRF (Simplified) [16] 20.092 0.822 0.179
DietNeRF [16] 23.147 0.866 0.109
DietNeRF + ft [16] 23.591 0.874 0.097
FreeNeRF [53] 24.259 0.883 0.098
SparseNeRF [42] 22.410 0.861 0.119
3DGS [18] 22.226 0.858 0.114
DNGaussian (Ours) 24.305 0.886 0.088
Table 2: Quantitative Comparison on Blender for 8 input views. The best, second-best, and third-best entries are marked in red, orange, and yellow, respectively.

Neural Color Renderer. 3D Gaussian Splatting stores color via spherical harmonics, however, it is easy to overfit with only sparse views. To relieve this problem, we take a grid encoder and an MLP as the Neural Color Renderer to predict color for each primitive (Figure 3). During inference, we store the intermediate result and only calculate the last MLP layers to merge view direction for acceleration.

4 Experiments

4.1 Setups

Datasets. we conduct our experiment on three datasets: the NeRF Blender Synthetic dataset (Blender) [25], the DTU dataset [17], and the LLFF dataset [24]. We follow the setting used in previous works [27, 53, 42] with the same split of DTU and LLFF to train the model on 3 views and test on another set of images. To erase the noises in the background and focus on the target object, we apply the same object masks as previous works [27] for DTU at evaluation. For Blender, we follow DietNeRF [16] and FreeNeRF [53] to train with the same 8 views and test on 25 unseen images. Aligned with the baselines, downsampling rates of 8888, 4444, and 2222 are applied to LLFF, DTU, and Blender. Following previous sparse-view settings, the camera poses are assumed to be known via calibration or other ways.

Evaluation Metrics. We report PSNR, SSIM [45], and LPIPS [59] scores to evaluate the reconstruction performance quantitatively. Also, an Average Error (AVGE) [27] is reported by the geometric mean of MSE=10PSNR/10MSEsuperscript10PSNR10\text{MSE}=10^{-\text{PSNR}/10}MSE = 10 start_POSTSUPERSCRIPT - PSNR / 10 end_POSTSUPERSCRIPT, 1SSIM1SSIM\sqrt{1-\text{SSIM}}square-root start_ARG 1 - SSIM end_ARG, and LPIPS.

Baselines. Following the previous sparse-view neural fields [27, 16, 53, 42], We take current SOTA methods SRF [8], PixelNeRF [55], MVSNeRF [5], Mip-NeRF [2], DietNeRF [16], RegNeRF [27], FreeNeRF [53], and SparseNeRF [42] as our baselines. For most NeRF-based methods, we directly report their best quantitative results in corresponding published papers for comparisons. The results of raw 3D Gaussian Splatting (3DGS) [18] are also reported.

Implementations. We build our models on the official PyTorch 3D Gaussian Splatting codebase. We train the model with 6,00060006,0006 , 000 iterations for all datasets, and the soft depth regularization is applied after 1,00010001,0001 , 000 iterations for stability. We set γ=0.1,τ=0.95formulae-sequence𝛾0.1𝜏0.95\gamma=0.1,\tau=0.95italic_γ = 0.1 , italic_τ = 0.95 in loss functions for all experiments. The neural renderer consists of a hash encoder [26] with 16161616 levels in a resolution range of 16161616 to 512512512512 and a max size of 219superscript2192^{19}2 start_POSTSUPERSCRIPT 19 end_POSTSUPERSCRIPT, and a 5555 layer MLP with the hidden dim of 64646464. We use DPT [29] to predict monocular depth maps for all input views. The models of 3DGS and DNGaussian are randomly initialized with a uniform distribution.

4.2 Comparison

Refer to caption
Figure 5: Qualitative Comparison on LLFF. 3DGS [18] fails to synthesize accurate novel views under sparse inputs. The rendering views from FreeNeRF [53] and SparseNeRF [42] are both smooth but with too many details lost. FreeNeRF further learns a wrong geometry in complete scenes. Our method learns more complete foreground geometry and renders high-quality novel views with fine details.

LLFF. The qualitative results and visualizations on the LLFF dataset from 3 input views are reported in Table 1 and Figure 5. Notably, since the NeRF-based baselines would interpolate colors to those invisible areas from input views while the discrete Gaussian radiance fields directly expose the black background on these empty spaces, the 3DGS-based methods natively have a weakness in the reconstruction metrics from these meaningless invisible areas. Despite that, our approach still outperforms all baselines in the LPIPS score, and achieves comparable PSNR, SSIM, and Average Error to the best methods. From both the quantitative and qualitative results, we can see that our DNGaussian predicts more fine details and precise geometry. FreeNeRF tends to synthesize smooth views that lack high-frequency details, also the geometry is not as accurate as the depth-supervised SparseNeRF and Our DNGaussian. Although regularized by same depth maps, SparseNeRF performs more weak in details and geometry completeness. DNGaussian also has huge improvements in both image geometry quality compared to the well-tuned 3DGS.


Refer to caption
Figure 6: Qualitative Comparison on DTU. Our method excels both in geometry and rendering qualities in difficult areas.

DTU. The quantitative results on the DTU 3-view setting reported in Table 1 show that our method achieves the best in LPIPS and SSIM, and the second best in Average Error. However, we got a lower score in PSNR, which is mainly due to scale variance and the noise occlusion coming from the textureless board and background in the scene. In the qualitative examples in Figure 6, It can be observed that our method can learn a more correct and complete geometry compared with both FreeNeRF and the depth-supervised SparseNeRF, and produces high-quality details even on difficult plush and reflective areas.


Refer to caption
Figure 7: Qualitative Comparison on Blender. The results demonstrate the strong fitting ability from surrounding views and reconstruct detailed and complex scenes.

Blender. To test the fitting ability from surrounding views, we make an evaluation of the Blender dataset under 8 input views. The scores are reported in Table 2, in which some data come from FreeNeRF [53] and DietNeRF [16]. Our method has got the best scores in all PSNR, SSIM and LPIPS. From the qualitative results in Figure 7, it can be seen that our method synthesizes views with correct geometry and fewer floaters compared to the vanilla 3DGS, and has a better performance in detail compared to the second-best FreeNeRF. The results demonstrate that DNGaussian can not only handle looking-forward scenes like LLFF and DTU, but also a whole reconstruction of complex objects with transparent and reflective materials.

Efficiency. We further conduct an efficiency study on the LLFF 3-view setting with RTX 3090 Ti GPUs to explore the performance of current SOTA baselines [42, 53] with limited GPU memories of 24GB/12GB, and training time of 1.0h/0.5h, as shown in Table 3. The top row of each group represents the default setting of the corresponding baseline, where the training time is measured by us with the same number of iterations on a single GPU. While both FreeNeRF and SparseNeRF perform worse under strict resource limitations, our method shows huge advantages in efficiency, which achieves remarkable accelerations of 25×25\times25 × on training time and over 3000×3000\times3000 × on FPS, while synthesizing competitive quality novel views. Given the necessity for per-scene optimization and rapid visualization, our high efficiency holds significant value for practical applications.

Method FPS Time GPU Mem PSNR\uparrow LPIPS\downarrow SSIM\uparrow
FreeNeRF [53] 9×\times×1022{}^{-2}start_FLOATSUPERSCRIPT - 2 end_FLOATSUPERSCRIPT 2.3 h 4×\times×48 GB 19.63 0.308 0.612
2.3 h 24 GB 19.71 0.322 0.603
1.0 h 24 GB 19.66 0.357 0.574
SparseNeRF [42] 9×\times×1022{}^{-2}start_FLOATSUPERSCRIPT - 2 end_FLOATSUPERSCRIPT 1.5 h 32 GB 19.86 0.328 0.624
1.5 h 12 GB 19.95 0.334 0.598
0.5 h 12 GB 19.94 0.341 0.585
Ours 300 3.5 min 2 GB 19.12 0.294 0.591
Table 3: Efficiency Comparison with Limited Resources. Our method achieves efficient training and the fastest real-time rendering while synthesizing competitive high-quality novel views.
Regularization Normalization PSNR \uparrow LPIPS \downarrow SSIM \uparrow
AP Hard Soft Local Global
18.14 0.354 0.538
17.90 0.351 0.525
18.31 0.339 0.552
18.68 0.331 0.565
18.55 0.324 0.562
19.12 0.294 0.591
Table 4: Ablation Study. We ablate our method on the LLFF 3-view setting. The results show the effect of our contributions.
Refer to caption
Figure 8: Visualization Results of Depth Regularization. Our Hard Depth Regularization significantly improves the high-frequency details but causes hollows. This drawback can be solved via the Soft Depth to synthesize fine details. We take the depth from dense views as the ground truth for comparison.

4.3 Ablation Study

We ablate our method on the LLFF 3-view setting. The quantitative results are reported in Table 4 and 5.

Depth Regularization. We ablate our Hard and Soft Depth Regularization with a plain all-parameter (AP) L2 reconstruction loss term. To better separately illustrate the effect of our two types of depth and exclude the influence of shape freezing, we further visualize a comparison to the situation only with shape freezing in Figure 8. It has been shown that the plain depth regularization can not effectively reshape the scene geometry, which proves the necessity of our method. Both the qualitative and quantitative results demonstrate our effect on geometry quality and high-frequency details.

Global-Local Depth Normalization. From the result, we can observe that only adding a global normalization can also help fitting, which is mainly due to the local patch-wise loss computation. After the attendance of local normalization, the rendering quality improves especially in detail. These improvements are much more obvious when applied to our proposed regularization than the all-parameter regularization that is unsuitable for the fields. The results correspond to our design and show the effectiveness of our Global-Local Depth Normalization.

Parameter Freezing. To illustrate the effect of our parameter-freezing strategy, we also ablate the shape freezing in regularization and center freezing in soft depth calculation. The results are shown in Table 5 and Figure 9. The visualization illustrates the problem of the depth-color conflict in Sec.3.2. In the situation without center freezing, some primitives may move to unexpected places to compensate for the depth loss, which causes lower quality. By introducing the proposed parameter freezing, we successfully relieve the problems and achieve the best results.

Refer to caption
Figure 9: Visualization Results of Shape Freezing. The synthesized novel view without shape freezing is filled with blurry areas, which is mainly caused by the over-smooth geometry learned from the monocular depth (left bottom).
Setting PSNR \uparrow LPIPS \downarrow SSIM \uparrow AVGE \downarrow
w/o Shape Freezing 17.96 0.363 0.547 0.160
w/o Center Freezing 18.87 0.300 0.584 0.140
All 19.12 0.294 0.591 0.132
Table 5: Ablation Study on Parameter Freezing. The results demonstrate the necessity of our parameter freezing strategy.

5 Conclusion

This paper presents the DNGaussian framework that introduces 3DGS into the few-shot novel view synthesis task by depth regularization. Due to the space limitation, we have put more discussions in the supplementary material.

Acknowledgements. In this work, we are supported by the National Natural Science Foundation of China 62276016, 62372029. Lin Gu is supported by JST Moonshot R&D Grant Number JPMJMS2011 Japan.

\thetitle

Supplementary Material

Overview

In the supplemental document, we first report additional studies in Sec. A of our proposed depth normalization, neural color renderer, and the performance of previous methods on fast grid-based backbones. Then, we describe the details of our implementation and dataset settings in our experiment in Sec. B. Finally, we discuss the limitations and future work of our method in Sec. C.

Appendix A Additional Results

A.1 Ablation Study on Depth Normalization

To better illustrate the roles of our Local and Global Depth Normalization, we conduct an additional ablation study and replace the L2 loss function with L1 to avoid its reduction of small losses. The quantitative visualization results are shown in Table 6 and Figure 10. In the comparison, we separately apply the Global and the Local one to illustrate the strengths and weaknesses of each: 1) Although the global one can also individually support the model to learn an overall scene, it is weak in optimizing minor errors, as we have discussed in Sec.3.3. 2) The local one can not stand alone due to the lack of absolute scale, but provides rich information on local depth changes. 3) By combining both techniques, our Global-Local Depth Normalization can simultaneously obtain the knowledge of both global scale and small local errors and achieve the best. Notably, since a different type of loss is used in this study, the scores vary from those reported in the main paper. Despite this, our method still performs well particularly in LPIPS and SSIM, which demonstrates the robustness of our depth normalization.

Setting PSNR\uparrow LPIPS\downarrow SSIM\uparrow AVGE\downarrow
Only Global 18.32 0.309 0.579 0.144
Only Lobal 17.17 0.338 0.523 0.167
All 18.67 0.291 0.595 0.137
Table 6: Additional Ablation Study on Depth Normalization. Combined with both two proposed depth normalizations, our Global-Local Depth Normalization achieves the best quality.

A.2 Ablation Study on Neural Color Renderer

In this work, we replace the spherical harmonic (SH) of 3D Gaussian Splatting with a neural color renderer to represent the direction-variant color. To better illustrate the function of this module, we compare it to the original SH function with different degrees in the LLFF dataset with 3 training views. The results are in Table 7 and Figure 11. The SH function is easy to overfit in the sparse-view situation and results in some strange colors during view changing. This may be caused by the independence of each primitive which leads to a lack of regional consistency. After introducing the neural color renderer, the problem has been relieved. By storing the intermediate result and only calculating the latest two MLP layers, we can maintain a fast rendering speed competitive to SH as well.

Refer to caption
Figure 10: Visualization of two proposed Depth Normalization. The color and depth map of the input view and synthesized novel view are shown in (a) and (b). The global one provides a global view of the whole scene, however, is weak in handling small local errors (white box), which causes blurry and wrong appearances (yellow box). In contrast, the local one is more sensitive to local depth changes. By combining both of them, our method can learn a more accurate scene geometry. Zoom in for better visualization.
Refer to caption
Figure 11: Visualization of two proposed Depth Normalization. In sparse-view situations, SH may produce inconsistent colors in unseen views (yellow arrow) due to overfitting, which can be relieved by a neural color renderer.
Setting PSNR\uparrow LPIPS\downarrow SSIM\uparrow AVGE\downarrow FPS
SH degree=2 17.06 0.333 0.549 0.167 340
SH degree=3 17.11 0.328 0.560 0.164 300
Neural Renderer 19.12 0.294 0.591 0.132 300
Table 7: Ablation Study on Neural Color Renderer. Our neural color renderer successfully improves the rendering quality while kee** an equally fast inference speed.
Backbone Strategy PSNR \uparrow LPIPS \downarrow SSIM \uparrow Time \downarrow VM Cost \downarrow FPS \uparrow
Mip-NeRF [2] None 14.62 0.495 0.351 2.2h \geq 32 GB 0.09
FreeNeRF 19.63 0.308 0.612 2.3h
SparseNeRF 19.86 0.328 0.624 1.5h
Instant-NGP [26] None 17.19 0.483 0.469 3.8min 3 GB 3
FreeNeRF 15.30 0.516 0.369 4.2min
SparseNeRF 17.19 0.478 0.476 7.5min
TensoRF [6] None 16.16 0.454 0.443 4.1min 8 GB 5
FreeNeRF 15.78 0.466 0.430 4.5min
SparseNeRF 16.11 0.465 0.443 8.9min
3DGS [18] None 16.46 0.401 0.440 2.7min 2 GB 280
FreeNeRF 16.55 0.399 0.472 2.7min
SparseNeRF 16.80 0.374 0.504 2.9min
3DGS [18] Ours 19.12 0.294 0.591 3.5min 2 GB 300
Table 8: Comparision of SOTA strategies FreeNeRF [53] and SparseNeRF [42] with different backbones. The best results for all and for each backbone are marked with bold and underline. Although FreeNeRF and SparseNeRF perform well on the implicit Mip-NeRF [2], they can weakly improve the quality with current fast backbones Instant-NGP [26], TensoRF [6], and also the 3DGS [18] in our work.
Refer to caption
Figure 12: Visualization of the implicit-based SOTA methods SparseNeRF (S) [42] and FreeNeRF (F) [53] when transferred to fast backbones Instant-NGP (iNGP) [26], TensoRF [6], and 3D Gaussian Splatting (3DGS) [18]. The depth-based SparseNeRF causes blurry rendering on all fast backbones, while FreeNeRF has less effect on the improvement of quality. Instead, our DNGaussian achieves the best quality on the most efficient 3DGS backbone, which shows the necessity of our method.

A.3 Transfer of Previous Strategies

In this section, we conduct an experiment to illustrate the necessity of our efficient DNGaussian. Indeed, there are some existing methods like FreeNeRF [53] and SparseNeRF [42] that are low in efficiency mainly due to their backbone rather than the strategy itself. However, they have only already been proven effective for some implicit backbones that are slow and costly. To verify whether they can directly transfer to faster backbones to achieve higher efficiency, we implement these two methods on two fast grid-based Instant-NGP [26] and TensoRF [6]. Also, we do this on our 3D Gaussian Splatting (3DGS) [18] backbone. Then, we test these implementations in the LLFF 3-view setting. The results are shown in Table 8 and Figure 12. Additionally, we report the training time (Time), GPU memory cost (VM Cost), and the inference FPS for each item.

Implementation details. We utilize a CUDA-implemented ray marching 111https://github.com/ashawkey/torch-ngp for the two grid-based backbones to achieve faster speed and lower costs. The 3DGS backbone employs the same neural color renderer as our method. We follow the original implementation of SparseNeRF to produce monocular depth maps for all input views and transfer its Local Depth Ranking Distillation to these new backbones with the same hyperparameters. For FreeNeRF, since the three fast backbones do not contain a frequency-based positional encoding, we apply the Frequency Regularization to their grid-based positional encoding as an alternative.

Comparison on grid-based backbones. Although these two methods perform well on their original implicit Mip-NeRF, they are weak in both Instant-NGP and TensoRF. SparseNeRF distills the depth ranking from the monocular depth map for regularization, however, causes more blurs. This may be caused by the stronger spatial memory ability from the explicit grids that makes it easier to memorize noises. FreeNeRF performs even worse on both these two backbones, which may be due to the different representations of positional encoding. In TensoRF, all these two strategies fail to improve performance. One reason may lie in that TensoRF directly utilizes explicit grids without a neural decoder to store density value, which is more difficult to regularize.

Comparison on 3DGS. In the comparison, the 3DGS backbone achieves the best efficiency, with the fast FPS and lowest training cost. However, both SparseNeRF and FreeNeRF cannot effectively regularize this powerful and efficient backbone. Due to the lack of frequency positional encoding, FreeNeRF serves more like a coarse-to-fine strategy and leads to only a little improvement. From the visualization of SparseNeRF in Figure 12, it can be observed that it is insufficient in the 3D Gaussian radiance fields of 3DGS to only keep the depth ranking and wait for the color-supervised optimization process to refine the detailed geometry. Compared with these two methods, our DNGaussian achieves a much better quality with only a little increment of training time. With less noise in the learned geometry, our method also achieves a faster inference speed.

Conclusion. The experiments show that the previous methods for implicit backbones can hardly, at least in an easy way, transfer to current fast backbones. Also, they are not suitable for the 3D Gaussian radiance fields. In such a situation, our DNGaussian shows significant value in providing an efficient way for high-quality and low-cost few-shot novel view synthesis.

A.4 Comparison with Grid-based Methods

There are some works [49, 38, 36] that utilize a grid-based backbone to improve the training efficiency. Since DaRF [36] is evaluated on another two datasets with at least 9 input views, while VGOS [38] and DiffusioNeRF [49] use different methods for the measurement of metrics, we do not take them as baselines in the main paper. Here we list the scores of VGOS and DiffusioNeRF in Table 9 in the LLFF 3-view setting for comparison. For VGOS, we only report scores for which the measurement method is definitely the same as in RegNeRF [27] and FreeNeRF [53]. The results of DiffusioNeRF are obtained from its updated paper on arXiv 222https://arxiv.longhoe.net/abs/2302.12231. In the comparison, our method outperforms the other two with the highest scores in LPIPS, SSIM, and AVGE. In fact, our method also achieves the best in efficiency, with much lower cost and faster inference.

A.5 Additional Visualizations

We provide more rendering results in our experiments. The examples on DTU and LLFF with 3 training views are shown in Figure 14 and 15. We have also shown more quantitative comparison in the Blender 8-view setting with the SOTA method FreeNeRF [53] in Figure 13. More results can be found in our supplementary video.

Refer to caption
Figure 13: Qualitative comparison on the Blender dataset with 8 input views. FreeNeRF [53] learns the accurate geometry by masking high-frequency signals, however, suffers from blurry details as the trade-off. In contrast, our method does not explicitly constrain the learning of high-frequency content. Also attributed to the 3D Gaussian neural fields, our method performs better in the fine-grained details.
Method PSNR\uparrow LPIPS\downarrow SSIM\uparrow AVGE\downarrow
VGOS [38] 19.35 0.432 - -
DiffusioNeRF [49] 19.79 0.338 0.568 0.136
Ours 19.12 0.294 0.591 0.132
Table 9: Comparison with grid-based few-shot NeRFs on LLFF with 3 training views. Our method outperforms grid-based methods VGOS [38] and DiffusioNeRF [38].
Refer to caption
Figure 14: Examples of the synthesized novel view results from DNGaussian with 3 input views on the DTU dataset.
Refer to caption
Figure 15: Examples of the synthesized novel view results from DNGaussian with 3 input views on the LLFF dataset.

Appendix B Details

B.1 Implementations

Pre-trained Depth Models. In this work, we use the pre-trained DPT [29, 30] estimator to predict the depth map, which has been widely used in many NeRF-based works [10, 42, 4, 48]. Particularly, we use the type of dpt_hybrid_384 for the LLFF dataset, while dpt_large_384 for DTU and Blender, which performs better for the pure white or black background. In fact, the performance gaps of our method when applying different types of depth models are slight, as shown in Table 10.

Type PSNR \uparrow LPIPS \downarrow SSIM \uparrow AVGE \downarrow
   LLFF
dpt_hybrid_384* 19.12 0.294 0.591 0.132
dpt_large_384 19.03 0.297 0.590 0.135
   DTU
dpt_hybrid_384 18.86 0.179 0.784 0.106
dpt_large_384* 18.91 0.176 0.790 0.102
Table 10: The influence of different pre-trained depth models. We replace the pre-trained depth model with a different type in our LLFF and DTU settings while kee** the same hyperparameters. The results show that our method is robust to different monocular depth estimators. * denotes the default type of each dataset.

Patch Size. In our implementation, we randomly sample a patch size from a pre-defined range for our patch-based Global-Local Depth Normalization. This range is set of [5,17]517[5,17][ 5 , 17 ] for LLFF and Blender, and a larger [17,51]1751[17,51][ 17 , 51 ] for DTU since the objects are smaller but occupy a large proportion of the image. Due to the flexibility of our normalization, we do not need to separately tune this value for each scene.

Metrics. Following previous methods [27, 53], we utilize the “structural_similarity” API from scikit-image 333https://scikit-image.org/docs/stable/api/skimage.metrics.html to compute the SSIM score, and use the implementation with a pre-trained VGG model to calculate the LPIPS score.

Camera Poses. Following existing works [27, 53, 42, 16], we assume all camera poses are already known. In practice, for LLFF and Blender, we use the given poses from the datasets. For DTU, we use COLMAP [34, 33] to calculate the camera poses according to all given views, and then sample the target sparse views from the results.

B.2 Datasets

LLFF. The LLFF dataset [24] contains 8 forward-facing scenes in total. Following [27, 53, 42], we take every 8-th image as the novel views for testing. The input views are evenly sampled across the remaining views. Images are downsampled 8×8\times8 × to the resolution of 378×504378504378\times 504378 × 504. In practice, we ignore the distortion of the original images.

DTU. The DTU dataset [17] consists of 124 object-centric scenes captured by a set of fixed cameras. We follow [27, 53, 42] to evaluate models directly on the 15 scenes with the scan IDs of 8, 21, 30, 31, 34, 38, 40, 41, 45, 55, 63, 82, 103, 110, and 114. In each scan, the images with the following IDs of 25, 22, and 28 are used as the input views in our 3-view setting. The test set consists of images with IDs of 1, 2, 9, 10, 11, 12, 14, 15, 23, 24, 26, 27, 29, 30, 31, 32, 33, 34, 35, 41, 42, 43, 45, 46 and 47 for evaluation. The images are downsampled 4×4\times4 ×. In particular, we use the undistorted images from COLMAP to eliminate the negative impact of unerased lens distortion.

Blender. We follow the data split used in [16, 53] for the Blender dataset [11]. The 8 input views are selected from the training images, with IDs 26, 86, 2, 55, 75, 93, 16, 73, and 8. The 25 test views are sampled evenly from the testing images for evaluation. All images are downsampled 2×2\times2 × to 400×400400400400\times 400400 × 400 during the experiment.

Appendix C Discussions and Limitations

Our DNGaussian utilizes coarse monocular depth to regularize the scene geometry in situations with sparse input views, and achieves significant improvement in the appearance quality. However, our method still has limitations such as below. We hope these issues can be solved in future work.

More Input Views. Besides only 3 input views, we have also explored the performance when the number of input views increases to 6 and 9 on the LLFF dataset, as shown in Table 11. In the experiment, it can be observed that as the number of views increases, the performance of the baseline also improves. Our DNGaussian can still improve the quality of the synthesized novel view with 6 input views. However, it does not work well when the number of input views increases to 9, which is nearly enough to provide sufficient color constraints. This may be due to the errors in the depth map that negatively influence the optimization process. The next step of our work can lie in leveraging the uncertainty of the monocular depth to filter out unreliable supervision.

Views Method PSNR \uparrow LPIPS \downarrow SSIM \uparrow AVGE \downarrow
3 3DGS 15.52 0.405 0.408 0.209
3DGS† 16.46 0.401 0.440 0.192
DNGaussian 19.12 0.294 0.591 0.132
6 3DGS 20.63 0.226 0.699 0.108
3DGS† 21.09 0.229 0.699 0.103
DNGaussian 22.18 0.198 0.755 0.088
9 3DGS 20.44 0.230 0.697 0.108
3DGS† 23.21 0.176 0.785 0.076
DNGaussian 23.17 0.180 0.788 0.077
Table 11: Comparison with 3, 6, and 9 input views on LLFF dataset. † denotes applied with the same hyperparameters and the neural color renderer as DNGaussian.

Solid Color Planes. The anisotropic shape of the Gaussian primitive makes it difficult to represent a solid color plane in a situation with sparse input views. First, the primitives are hard to constrain both by color and depth in the region of the plane, which may cause ray-like noises and hollows. Also, since they can freely move to other regions with similar colors, the densification operation can be activated more frequently and generate noises. This is hoped solved by additional geometry priors.

Specular Regions. Although our method can handle some specular regions by relying on depth supervision, especially from our Local Depth Normalization, the inconsistent appearances in these regions are still challenging for 3DGS. To completely solve this problem may still need more special designs.

Hollows and Cracks. The splatting technique of our Gaussian Splatting [18] backbone directly merges existing primitives to render the pixel-level color without interpolation. However, since not every pixel can be overlapped by the projected primitives, the empty space between two Gaussian primitives would cause hollows and cracks when the camera pose changes. For example, some hollows can be seen at Scan 40 in Figure 14. In this work, we try to solve this problem by paying more attention to high-frequency details and therefore encouraging the densifying of primitives to fill these empty areas. In the future, we believe this problem can be fundamentally solved by the improvement of the representation itself.

References

  • Avidan and Shashua [1997] Shai Avidan and Amnon Shashua. Novel view synthesis in tensor space. In Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 1034–1040. IEEE, 1997.
  • Barron et al. [2021] Jonathan T Barron, Ben Mildenhall, Matthew Tancik, Peter Hedman, Ricardo Martin-Brualla, and Pratul P Srinivasan. Mip-nerf: A multiscale representation for anti-aliasing neural radiance fields. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5855–5864, 2021.
  • Barron et al. [2022] Jonathan T Barron, Ben Mildenhall, Dor Verbin, Pratul P Srinivasan, and Peter Hedman. Mip-nerf 360: Unbounded anti-aliased neural radiance fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5470–5479, 2022.
  • Bian et al. [2023] Wen**g Bian, Zirui Wang, Kejie Li, Jia-Wang Bian, and Victor Adrian Prisacariu. Nope-nerf: Optimising neural radiance field with no pose prior. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4160–4169, 2023.
  • Chen et al. [2021] Anpei Chen, Zexiang Xu, Fuqiang Zhao, Xiaoshuai Zhang, Fanbo Xiang, **gyi Yu, and Hao Su. Mvsnerf: Fast generalizable radiance field reconstruction from multi-view stereo. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 14124–14133, 2021.
  • Chen et al. [2022] Anpei Chen, Zexiang Xu, Andreas Geiger, **gyi Yu, and Hao Su. Tensorf: Tensorial radiance fields. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXII, pages 333–350. Springer, 2022.
  • Chen et al. [2023] Zhang Chen, Zhong Li, Liangchen Song, Lele Chen, **gyi Yu, Junsong Yuan, and Yi Xu. Neurbf: A neural fields representation with adaptive radial basis functions. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4182–4194, 2023.
  • Chibane et al. [2021] Julian Chibane, Aayush Bansal, Verica Lazova, and Gerard Pons-Moll. Stereo radiance fields (srf): Learning view synthesis for sparse views of novel scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7911–7920, 2021.
  • Cong et al. [2023] Wenyan Cong, Hanxue Liang, Peihao Wang, Zhiwen Fan, Tianlong Chen, Mukund Varma, Yi Wang, and Zhangyang Wang. Enhancing nerf akin to enhancing llms: Generalizable nerf transformer with mixture-of-view-experts. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3193–3204, 2023.
  • Deng et al. [2023] Congyue Deng, Chiyu Jiang, Charles R Qi, Xinchen Yan, Yin Zhou, Leonidas Guibas, Dragomir Anguelov, et al. Nerdi: Single-view nerf synthesis with language-guided diffusion as general image priors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20637–20647, 2023.
  • Deng et al. [2022] Kangle Deng, Andrew Liu, Jun-Yan Zhu, and Deva Ramanan. Depth-supervised nerf: Fewer views and faster training for free. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12882–12891, 2022.
  • Fridovich-Keil et al. [2022] Sara Fridovich-Keil, Alex Yu, Matthew Tancik, Qinhong Chen, Benjamin Recht, and Angjoo Kanazawa. Plenoxels: Radiance fields without neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5501–5510, 2022.
  • Fridovich-Keil et al. [2023] Sara Fridovich-Keil, Giacomo Meanti, Frederik Rahbæk Warburg, Benjamin Recht, and Angjoo Kanazawa. K-planes: Explicit radiance fields in space, time, and appearance. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12479–12488, 2023.
  • Hu et al. [2023a] Shoukang Hu, Kaichen Zhou, Kaiyu Li, Longhui Yu, Lanqing Hong, Tianyang Hu, Zhenguo Li, Gim Hee Lee, and Ziwei Liu. Consistentnerf: Enhancing neural radiance fields with 3d consistency for sparse view synthesis. arXiv preprint arXiv:2305.11031, 2023a.
  • Hu et al. [2023b] Wenbo Hu, Yuling Wang, Lin Ma, Bangbang Yang, Lin Gao, Xiao Liu, and Yuewen Ma. Tri-miprf: Tri-mip representation for efficient anti-aliasing neural radiance fields. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 19774–19783, 2023b.
  • Jain et al. [2021] Ajay Jain, Matthew Tancik, and Pieter Abbeel. Putting nerf on a diet: Semantically consistent few-shot view synthesis. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5885–5894, 2021.
  • Jensen et al. [2014] Rasmus Jensen, Anders Dahl, George Vogiatzis, Engin Tola, and Henrik Aanæs. Large scale multi-view stereopsis evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 406–413, 2014.
  • Kerbl et al. [2023] Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering. ACM Transactions on Graphics (ToG), 42(4):1–14, 2023.
  • Kim et al. [2022] Mijeong Kim, Seonguk Seo, and Bohyung Han. Infonerf: Ray entropy minimization for few-shot neural volume rendering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12912–12921, 2022.
  • Kulhánek et al. [2022] Jonáš Kulhánek, Erik Derner, Torsten Sattler, and Robert Babuška. Viewformer: Nerf-free neural rendering from few images using transformers. In European Conference on Computer Vision (ECCV), 2022.
  • Liu et al. [2020] Lingjie Liu, Jiatao Gu, Kyaw Zaw Lin, Tat-Seng Chua, and Christian Theobalt. Neural sparse voxel fields. Advances in Neural Information Processing Systems, 33:15651–15663, 2020.
  • Liu et al. [2023] Ruoshi Liu, Rundi Wu, Basile Van Hoorick, Pavel Tokmakov, Sergey Zakharov, and Carl Vondrick. Zero-1-to-3: Zero-shot one image to 3d object. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9298–9309, 2023.
  • Luiten et al. [2023] Jonathon Luiten, Georgios Kopanas, Bastian Leibe, and Deva Ramanan. Dynamic 3d gaussians: Tracking by persistent dynamic view synthesis. arXiv preprint arXiv:2308.09713, 2023.
  • Mildenhall et al. [2019] Ben Mildenhall, Pratul P Srinivasan, Rodrigo Ortiz-Cayon, Nima Khademi Kalantari, Ravi Ramamoorthi, Ren Ng, and Abhishek Kar. Local light field fusion: Practical view synthesis with prescriptive sampling guidelines. ACM Transactions on Graphics (TOG), 38(4):1–14, 2019.
  • Mildenhall et al. [2021] Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM, 65(1):99–106, 2021.
  • Müller et al. [2022] Thomas Müller, Alex Evans, Christoph Schied, and Alexander Keller. Instant neural graphics primitives with a multiresolution hash encoding. ACM Transactions on Graphics (ToG), 41(4):1–15, 2022.
  • Niemeyer et al. [2022] Michael Niemeyer, Jonathan T Barron, Ben Mildenhall, Mehdi SM Sajjadi, Andreas Geiger, and Noha Radwan. Regnerf: Regularizing neural radiance fields for view synthesis from sparse inputs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5480–5490, 2022.
  • Qian et al. [2023] Guocheng Qian, **jie Mai, Abdullah Hamdi, Jian Ren, Aliaksandr Siarohin, Bing Li, Hsin-Ying Lee, Ivan Skorokhodov, Peter Wonka, Sergey Tulyakov, et al. Magic123: One image to high-quality 3d object generation using both 2d and 3d diffusion priors. arXiv preprint arXiv:2306.17843, 2023.
  • Ranftl et al. [2021] René Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vision transformers for dense prediction. ICCV, 2021.
  • Ranftl et al. [2022] René Ranftl, Katrin Lasinger, David Hafner, Konrad Schindler, and Vladlen Koltun. Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(3), 2022.
  • Roessle et al. [2022] Barbara Roessle, Jonathan T Barron, Ben Mildenhall, Pratul P Srinivasan, and Matthias Nießner. Dense depth priors for neural radiance fields from sparse input views. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12892–12901, 2022.
  • Sargent et al. [2023] Kyle Sargent, Zizhang Li, Tanmay Shah, Charles Herrmann, Hong-Xing Yu, Yunzhi Zhang, Eric Ryan Chan, Dmitry Lagun, Li Fei-Fei, Deqing Sun, et al. Zeronvs: Zero-shot 360-degree view synthesis from a single real image. arXiv preprint arXiv:2310.17994, 2023.
  • Schönberger and Frahm [2016] Johannes Lutz Schönberger and Jan-Michael Frahm. Structure-from-Motion Revisited. In Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
  • Schönberger et al. [2016] Johannes Lutz Schönberger, Enliang Zheng, Marc Pollefeys, and Jan-Michael Frahm. Pixelwise View Selection for Unstructured Multi-View Stereo. In European Conference on Computer Vision (ECCV), 2016.
  • Seo et al. [2023] Seunghyeon Seo, Yeon** Chang, and Nojun Kwak. Flipnerf: Flipped reflection rays for few-shot novel view synthesis. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 22883–22893, 2023.
  • Song et al. [2023] Jiuhn Song, Seonghoon Park, Honggyu An, Seokju Cho, Min-Seop Kwak, Sung** Cho, and Seungryong Kim. Därf: Boosting radiance fields from sparse inputs with monocular depth adaptation, 2023.
  • Sun et al. [2022] Cheng Sun, Min Sun, and Hwann-Tzong Chen. Direct voxel grid optimization: Super-fast convergence for radiance fields reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5459–5469, 2022.
  • Sun et al. [2023] Jiakai Sun, Zhanjie Zhang, Jiafu Chen, Guangyuan Li, Boyan Ji, Lei Zhao, and Wei Xing. Vgos: Voxel grid optimization for view synthesis from sparse inputs. In Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, IJCAI-23, pages 1414–1422. International Joint Conferences on Artificial Intelligence Organization, 2023. Main Track.
  • Tang et al. [2023] Jiaxiang Tang, Jiawei Ren, Hang Zhou, Ziwei Liu, and Gang Zeng. Dreamgaussian: Generative gaussian splatting for efficient 3d content creation. arXiv preprint arXiv:2309.16653, 2023.
  • Uy et al. [2023] Mikaela Angelina Uy, Ricardo Martin-Brualla, Leonidas Guibas, and Ke Li. Scade: Nerfs from space carving with ambiguity-aware depth estimates. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16518–16527, 2023.
  • Wang et al. [2022] Chen Wang, Xiang Wang, Jiawei Zhang, Liang Zhang, Xiao Bai, Xin Ning, Jun Zhou, and Edwin Hancock. Uncertainty estimation for stereo matching based on evidential deep learning. Pattern Recognition, 124:108498, 2022.
  • Wang et al. [2023] Guangcong Wang, Zhaoxi Chen, Chen Change Loy, and Ziwei Liu. Sparsenerf: Distilling depth ranking for few-shot novel view synthesis. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 9065–9076, 2023.
  • Wang et al. [2021] Xiang Wang, Chen Wang, Bing Liu, Xiaoqing Zhou, Liang Zhang, ** Zheng, and Xiao Bai. Multi-view stereo in the deep learning era: A comprehensive review. Displays, 70:102102, 2021.
  • Wang et al. [2024a] Xiang Wang, Haonan Luo, Zihang Wang, ** Zheng, and Xiao Bai. Robust training for multi-view stereo networks with noisy labels. Displays, 81:102604, 2024a.
  • Wang et al. [2004] Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, 13(4):600–612, 2004.
  • Wang et al. [2024b] Zihang Wang, Haonan Luo, Xiang Wang, ** Zheng, Xin Ning, and Xiao Bai. A contrastive learning based unsupervised multi-view stereo with multi-stage self-training strategy. Displays, page 102672, 2024b.
  • Wu et al. [2023] Guanjun Wu, Taoran Yi, Jiemin Fang, Lingxi Xie, Xiaopeng Zhang, Wei Wei, Wenyu Liu, Qi Tian, and Xinggang Wang. 4d gaussian splatting for real-time dynamic scene rendering. arXiv preprint arXiv:2310.08528, 2023.
  • Wu et al. [2022] Zi** Wu, Xingyi Li, Juewen Peng, Hao Lu, Zhiguo Cao, and Weicai Zhong. Dof-nerf: Depth-of-field meets neural radiance fields. In Proceedings of the 30th ACM International Conference on Multimedia, pages 1718–1729, 2022.
  • Wynn and Turmukhambetov [2023] Jamie Wynn and Daniyar Turmukhambetov. Diffusionerf: Regularizing neural radiance fields with denoising diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4180–4189, 2023.
  • Xiong et al. [2023] Haolin Xiong, Sairisheek Muttukuru, Rishi Upadhyay, Pradyumna Chari, and Achuta Kadambi. Sparsegs: Real-time 360° sparse view synthesis using gaussian splatting. arXiv preprint arXiv:2312.00206, 2023.
  • Xu et al. [2023] Dejia Xu, Yifan Jiang, Peihao Wang, Zhiwen Fan, Yi Wang, and Zhangyang Wang. Neurallift-360: Lifting an in-the-wild 2d photo to a 3d object with 360deg views. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4479–4489, 2023.
  • Xu et al. [2022] Qiangeng Xu, Zexiang Xu, Julien Philip, Sai Bi, Zhixin Shu, Kalyan Sunkavalli, and Ulrich Neumann. Point-nerf: Point-based neural radiance fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5438–5448, 2022.
  • Yang et al. [2023] Jiawei Yang, Marco Pavone, and Yue Wang. Freenerf: Improving few-shot neural rendering with free frequency regularization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8254–8263, 2023.
  • Yu et al. [2021a] Alex Yu, Ruilong Li, Matthew Tancik, Hao Li, Ren Ng, and Angjoo Kanazawa. Plenoctrees for real-time rendering of neural radiance fields. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5752–5761, 2021a.
  • Yu et al. [2021b] Alex Yu, Vickie Ye, Matthew Tancik, and Angjoo Kanazawa. pixelnerf: Neural radiance fields from one or few images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4578–4587, 2021b.
  • Yu et al. [2022] Zehao Yu, Songyou Peng, Michael Niemeyer, Torsten Sattler, and Andreas Geiger. Monosdf: Exploring monocular geometric cues for neural implicit surface reconstruction. Advances in neural information processing systems, 35:25018–25032, 2022.
  • Zhang et al. [2022a] Chi Zhang, Wei Yin, Billzb Wang, Gang Yu, Bin Fu, and Chunhua Shen. Hierarchical normalization for robust monocular depth estimation. Advances in Neural Information Processing Systems, 35:14128–14139, 2022a.
  • Zhang et al. [2022b] Jiawei Zhang, Xiang Wang, Xiao Bai, Chen Wang, Lei Huang, Yimin Chen, Lin Gu, Jun Zhou, Tatsuya Harada, and Edwin R Hancock. Revisiting domain generalized stereo matching networks from a feature consistency perspective. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13001–13011, 2022b.
  • Zhang et al. [2018] Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018.
  • Zhou et al. [2016] Tinghui Zhou, Shubham Tulsiani, Weilun Sun, Jitendra Malik, and Alexei A Efros. View synthesis by appearance flow. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part IV 14, pages 286–301. Springer, 2016.
  • Zhou et al. [2023] Xiaoqing Zhou, Xiang Wang, ** Zheng, and Xiao Bai. Adaptive spatial sparsification for efficient multi-view stereo matching. Acta Electronica Sinica, 51(11):3079–3091, 2023.
  • Zhou and Tulsiani [2023] Zhizhuo Zhou and Shubham Tulsiani. Sparsefusion: Distilling view-conditioned diffusion for 3d reconstruction. In CVPR, 2023.
  • Zhu et al. [2023] Zehao Zhu, Zhiwen Fan, Yifan Jiang, and Zhangyang Wang. Fsgs: Real-time few-shot view synthesis using gaussian splatting. arXiv preprint arXiv:2312.00451, 2023.