DNGaussian: Optimizing Sparse-View 3D Gaussian Radiance Fields
with Global-Local Depth Normalization

Jiahe Li

{}^{1}

, Jiawei Zhang

{}^{1}

, Xiao Bai

{}^{1}

, ** Zheng

{}^{1}

, Xin Ning

{}^{2}

, Jun Zhou

{}^{3}

, Lin Gu

{}^{4,5}

{}^{1}

School of Computer Science and Engineering, State Key Laboratory of
Complex & Critical Software Environment, Jiangxi Research Institute, Beihang University

{}^{2}

Institute of Semiconductors, Chinese Academy of Sciences

{}^{3}

School of Information and Communication Technology, Griffith University

{}^{4}

RIKEN AIP

{}^{5}

The University of Tokyo Corresponding author: Xiao Bai ([email protected]).

Abstract

Radiance fields have demonstrated impressive performance in synthesizing novel views from sparse input views, yet prevailing methods suffer from high training costs and slow inference speed. This paper introduces DNGaussian, a depth-regularized framework based on 3D Gaussian radiance fields, offering real-time and high-quality few-shot novel view synthesis at low costs. Our motivation stems from the highly efficient representation and surprising quality of the recent 3D Gaussian Splatting, despite it will encounter a geometry degradation when input views decrease. In the Gaussian radiance fields, we find this degradation in scene geometry primarily lined to the positioning of Gaussian primitives and can be mitigated by depth constraint. Consequently, we propose a Hard and Soft Depth Regularization to restore accurate scene geometry under coarse monocular depth supervision while maintaining a fine-grained color appearance. To further refine detailed geometry resha**, we introduce Global-Local Depth Normalization, enhancing the focus on small local depth changes. Extensive experiments on LLFF, DTU, and Blender datasets demonstrate that DNGaussian outperforms state-of-the-art methods, achieving comparable or better results with significantly reduced memory cost, a $25\times$ reduction in training time, and over $3000\times$ faster rendering speed. Code is available at: https://github.com/Fictionarry/DNGaussian .

Figure 1: Comparison of the state-of-the-arts FreeNeRF [53] and SparseNeRF [42] with our DNGaussian utilizing three views for training. DNGaussian stands out by delivering comparably high-quality synthesized views and superior details with a remarkable 25× reduction in time and significantly lower memory overhead during training, while attaining the fastest and the only real-time rendering speed of 300 FPS. The point cloud of Gaussians illustrates the detailed and explainable spatial representation learned through our method.

^†^†

{}^{*}

Corresponding author: Xiao Bai ([email protected]).

1 Introduction

Novel view synthesis with sparse inputs poses a challenge for radiance fields. Recent advances in neural radiance fields (NeRF) have excelled in reconstructing photorealistic appearance and accurate geometry from just a handful of input views [27, 53, 42, 35, 49, 36, 5, 55, 16]. However, most sparse-view NeRFs are implemented with low processing speed and substantial memory consumption, resulting in high time and computational costs that restrict their practical applications. While some methods [36, 38, 49] achieve faster inference speed with grid-based backbones [26, 37, 13], they often suffer from trade-offs, leading to either high training costs or compromised rendering quality.

Recently, 3D Gaussian Splatting [18] has introduced an unstructured 3D Gaussian radiance field, employing a set of 3D Gaussian primitives to achieve remarkable success in rapid, high-quality, and low-cost novel view synthesis, when learned from color dense input views. Even with only sparse inputs, it can still partially retain the surprising ability to reconstruct some clear and detailed local features. Nevertheless, the decrease in view constraints makes a significant portion of scene geometry be incorrectly learned, resulting in failures in novel view synthesis, as illustrated in Figure 2. Inspired by the success of earlier depth-regularized sparse-view NeRFs [42, 36], this paper explores distilling depth information from pre-trained monocular depth estimators to rectify the Gaussian fields of the ill-learned geometry, and introduce the Depth Normalization Regularized Sparse-view 3D Gaussian Radiance Fields (DNGaussian) to pursue higher quality and efficiency for few-shot novel view synthesis.

Refer to caption — Figure 2: 3D Gaussian Splatting [18] exhibits its potential to reconstruct some fine details (green box) from sparse input views. Nevertheless, the reduced input views would significantly degrade geometry and cause failed reconstruction (orange box). After applying depth regularization, DNGaussian successfully recovers accurate geometry and synthesizes high-quality novel views.

Despite sharing a similar form of depth rendering, the depth regularization for 3D Gaussian radiance fields differs significantly from that employed by NeRF. Firstly, existing depth regularization strategies for NeRFs commonly employ depth to regularize the entire model, which creates a potential geometry conflict in the Gaussian fields that adversely affects quality. Specifically, this practice forces the shape of Gaussians to fit the smooth monocular depth rather than the complex color appearance and thus results in loss of details and blurred appearance. Considering that the basis of scene geometry lies in the position of the Gaussian primitives rather than their shape, we freeze the shape parameters and propose a Hard and Soft Depth Regularization to enable spatial resha** by encouraging movement among the primitives. During regularization, we propose rendering two types of depth to independently adjust the center and opacity of Gaussians without changing their shape, therefore striking a balance between the fitting of complex color appearance and smooth coarse depth.

Moreover, Gaussian radiance fields are more sensitive to small depth errors when compared to NeRF, which may result in a noisy distribution of primitives and failures in regions with complex textures. Existing scale-invariant depth losses often opt to align depth maps to a fixed scale, which leads to the overlook of small losses. To address this issue, we introduce the Global-Local Depth Normalization into the depth loss function, thus encouraging the learning of small local depth changes in a scale-invariant way. With the local and global scale normalization, our method guides the loss function to refocus on small local errors while maintaining knowledge on the absolute scale, to enhance the detailed geometry resha** process for depth regularization.

Integrating the two proposed techniques, DNGaussian synthesizes views with competitive quality and superior details compared to state-of-the-art methods in multiple sparse-view settings on LLFF, Blender, and DTU datasets. This advantage is further enriched by substantially lower memory costs, $25\times$ reduction of training time, and over $3000\times$ faster rendering speed. The experiments also demonstrate our method’s universal ability to fit complex scenes, wide-ranging views, and multiple materials.

Our main contributions are the following:

•

A Hard and Soft Depth Regularization to constrain the geometry of 3D Gaussian radiance fields by encouraging the movement of Gaussians, which enables the coarse-depth regularized space resha** without compromising fine-grained color performance.
•

A Global-Local Depth Normalization that normalizes depth patches on local scales to achieve a refocus on small local depth changes, thereby improving the reconstruction of detail appearance for 3D Gaussian radiance fields.
•

A DNGaussian framework for fast and high-quality few-shot novel view synthesis, which combines the above two techniques and achieves competitive quality across multiple benchmarks compared to the state-of-the-art methods, excelling in capturing details with significantly lower training costs and real-time rendering.

To the best of our knowledge, we are the first attempt to analyze and address the depth regularization problem for 3D Gaussian Splatting under coarse depth cues. We hope this paper can inspire more ideas for optimizing radiance fields in under-constrained situations.

2 Related Work

Radiance Fields for Novel View Synthesis. Novel view synthesis aims to generate unseen views of the same object or scene from a set of given images [60, 1]. Neural Radiance Fields (NeRF) [25] uses a large MLP to represent 3D scenes and renders via volume rendering. However, its speed is slow both in training and inference. The following improvements mainly pursue either higher quality [2, 3] or efficiency [6, 26, 37, 12, 54, 21, 15], but hard to achieve both. The most recent unstructured radiance fields [7, 52, 18] utilize a set of primitives to represent scenes. Among them, 3D Gaussian Splatting [18] represents radiance fields by a set of anisotropic 3D Gaussians and renders with a differentiable splatting. This approach achieves great success in fast and high-quality reconstruction for complex real scenes. While this method excels with dense input views and has achieved success in various 3D tasks [23, 47, 39], its reconstruction with sparse view inputs remains an open problem. Also, issues such as how to apply additional constraints for improvement are still unsolved and worthy of discussion.

Few-shot Novel View Synthesis. Few-shot novel view synthesis aims to generate novel views from only a set of sparse input views. Many works address the problem by introducing regularization strategies specified for NeRF [53, 27, 19, 11]. Some pre-trained methods aim to design a generative model and train it on large datasets [5, 55, 9, 62, 20], while others [49, 16] take pre-trained models as a type of loss to regularize the training process with well-learned knowledge. Depth distilling [11, 31, 36, 42] is also a powerful technique for sparse-view neural fields. However, limited by their powerful but slow backbones or the complex pre-trained models, most of these methods are costly in both training and inference. Although some methods [36, 38, 49] have improved inference efficiency via grid-based backbones, they also suffer from trade-offs like higher training costs or lower quality. More recently, some work [22, 32, 28] enable zero-shot novel view synthesis with even one input by diffusion model priors, but can hardly handle complex scenes and with lower efficiency.

Depth Supervision in Sparse-view Neural Fields. As a classic cue in many 3D vision tasks [46, 44, 41, 58, 43, 61], depth information has been widely used to supervise sparse-view neural fields. The first group [11, 31] is to extract accurate but sparse depth values from reliable point clouds, and the second [56, 14, 36, 40, 42] distills depth knowledge from current powerful monocular depth estimators [30, 29]. Considering point clouds are sparse and not available in many sparse-view cases, monocular depth shows its advantage in density and robustness for our tasks. To tackle the scale ambiguity and error of monocular depths, some previous works and concurrent sparse-view 3DGS methods have introduced various scale-invariant losses [56, 10, 36, 50, 63] including depth ranking loss [51, 42], however, all of which are not optimal for us. Firstly, flexible Gaussians are more sensitive to wrong depth cues, requiring extra designs for regularization. Also, these losses align the depth to a certain fixed global scale, which may ignore minor local depth changes. This overlook can lead to a noisy primitive distribution, particularly in regions with intricate textures. Besides, we notice an HDN loss [57] that can preserve details in monocular depth estimation. Nevertheless, it is also unsuitable as its reliance on multi-scale patches would bring long-distance errors and compromise geometric accuracy.

3 Method

3.1 Preliminary for 3D Gaussian Splatting

Representation. 3D Gaussian splatting [18] represents 3D information with a set of 3D Gaussians. It computes pixel-wise color $\mathcal{C}$ with a set of 3D Gaussian primitives $\theta$ , view pose $P$ , and the camera parameter involving the center $o$ .

Specifically, a Gaussian primitive can be described with a center $\mu\in\mathbb{R}^{3}$ , a scaling factor $s\in\mathbb{R}^{3}$ , and a rotation quaternion $q\in\mathbb{R}^{4}$ . The basis function of the $i$ -th primitive $\mathcal{G}_{i}$ is in the form of:

\mathcal{G}_{i}(x)=e^{-\frac{1}{2}(x-\mu_{i})^{T}\Sigma_{i}^{-1}(x-\mu_{i})},

(1)

where the covariance matrix $\Sigma$ can be calculated from the scale $s$ and rotation $q$ . For rendering purposes, the Gaussian primitive also retains an opacity value $\alpha\in\mathbb{R}$ and a $K$ -dimensional color feature $f\in\mathbb{R}^{K}$ . Then $\theta_{i}=\{\mu_{i},s_{i},q_{i},\alpha_{i},f_{i}\}$ is the parameters for the $i$ -th Gaussian.

Rendering. 3D Gaussian Splatting utilizes a point-based rendering to compute the color $\mathcal{C}$ of pixel $x_{p}$ by blending $N$ ordered Gaussians overlap** the pixel:

\mathcal{C}(x_{p})=\sum_{i\in N}{c_{i}\widetilde{\alpha}_{i}\prod_{j=1}^{i-1}(% 1-\widetilde{\alpha}_{j})},

(2)

where $c_{i}$ is the decoded color of feature $f$ .

Different from NeRF’s ray sampling strategy, the involved $N$ Gaussians are gathered by a well-optimized rasterizer according to $x_{p}$ , the camera parameter, the view pose $P$ , and a set of pre-defined roles. And the rendering opacity $\widetilde{\alpha}$ of $N$ primitives are calculated by $\alpha$ and their projected 2D Gaussians $\mathcal{G}^{proj}$ on image plane :

\widetilde{\alpha}_{i}=\alpha_{i}\mathcal{G}^{proj}_{i}(x_{p}).

(3)

Then, similar to NeRF, we can represent the pixel-wise depth $\mathcal{D}$ with the distance to the camera center $o$ :

\mathcal{D}(x_{p})=\sum_{i\in N}{||\mu_{i}-o||_{2}}\times\widetilde{\alpha}_{i% }\prod_{j=1}^{i-1}(1-\widetilde{\alpha}_{j}).

(4)

Optimzation. 3D Gaussian Splatting optimizes the parameters $\theta$ for all Gaussians through gradient descent under color supervision. During the optimization process, it identifies and duplicates the most active primitives to better represent intricate textures, simultaneously removing redundant primitives. In this work, we inherit these optimization strategies for color supervision.

Initialization. To start from a better geometry, the method suggests utilizing the point cloud from COLMAP [34, 33] or other SfMs to initialize the Gaussians. Instead, considering the instability of point clouds in sparse-view situations, we initialize our method with a random set of Gaussians.

3.2 Depth Regularization for Gaussians

Despite sharing a similar depth computation, existing depth regularization for NeRFs cannot transfer to 3D Gaussian radiance fields due to the huge differences. First, a target conflict between color and depth would occur in the extra parameters. Also, previous regularization for the continuous NeRF only focuses on density, for which it can hardly work well on the discrete and flexible Gaussian primitives.

Shape Freezing. 3D Gaussian radiance fields possess four optimizable parameters $\{\mu,s,q,\alpha\}$ that can directly influence the depth, which is more complex than NeRF. Since the mono-depth is much smoother and easier to fit than color, apply an all-parameter depth regularization on the entire model, which is widely used in previous sparse-view neural fields [56, 42, 14, 10, 51], would lead the shape parameters to overfit the target depth map and cause blurry appearance. Thus, these parameters must be treated differently. Since the scene geometry is mainly represented by the position distribution of Gaussian primitives, we regard the center $\mu$ and opacity $\alpha$ as the most important parameters to regularize, for they separately stand for the position itself and the occupancy of a position. Furthermore, to reduce the negative influence for color reconstruction, we freeze the scaling $s$ and rotation $q$ in the depth regularization.

Hard Depth Regularization. To achieve the spatial resha** of the Gaussian fields, we first propose a Hard Depth Regularization that encourages the movement of the nearest Gaussians, which are expected to compose surfaces but often cause noises and artifacts. Considering the predicted depth is rendered with the mixture of multiple Gaussians and reweighted by the cumulative product $\widetilde{\alpha}$ , we manually apply a large opacity value $\tau$ to all Gaussians. Then, we render a “hard depth” that mainly consists of the nearest Gaussians on the ray shot from camera center $o$ and across the pixel $x_{p}$ :

\mathcal{D}_{hard}(x_{p})=\sum_{i\in N}{\tau(1-\tau)^{i-1}\mathcal{G}^{proj}_{% i}(x_{p})||\mu_{i}-o||_{2}}.

(5)

Since now only the center $\mu$ is in optimization, Gaussians at wrong positions cannot avoid being regularized by decreasing their opacity or changing shapes, and thus their centers $\mu$ move. The regularization is implemented by a similarity loss at the target image area $\mathcal{P}$ to encourage the hard depth $\mathcal{D}_{hard}$ close to the monocular depth $\widetilde{\mathcal{D}}$ :

\mathcal{R}_{hard}(\mathcal{P})=\mathcal{L}_{similar}(\mathcal{D}_{hard}(% \mathcal{P}),\widetilde{\mathcal{D}}(\mathcal{P})).

(6)

Soft Depth Regularization. Only regularizing on “hard depth” is insufficient due to the absence of opacity optimization. We also expect to ensure the accuracy of the real rendered “soft depth”, otherwise, the surface may become semitransparent and cause hollowness. From this perspective, we additionally freeze the Gaussian center $\mu$ (denoted by $\check{\mu}$ ) to avoid negative influence caused by the center moving, and propose Soft Depth Regularization for the tuning of the opacity $\alpha$ :

	$\displaystyle\mathcal{D}_{soft}(x_{p})=\sum_{i\in N}{\|\|\check{\mu_{i}}-o\|\|_{2}% }\times\widetilde{\alpha}_{i}\prod_{j=1}^{i-1}(1-\widetilde{\alpha}_{j}),$		(7)
	$\displaystyle\mathcal{R}_{soft}(\mathcal{P})=\mathcal{L}_{similar}(\mathcal{D}% _{soft}(\mathcal{P}),\widetilde{\mathcal{D}}(\mathcal{P})).$		(7)

With both the Hard and Soft Depth Regularization, we constrain the nearest Gaussians to stay in a suitable position with high opacity, therefore composing complete surfaces.

3.3 Global-Local Depth Normalization

Previous depth-supervised neural fields usually build the depth loss on the source scales of the depth maps[56, 36, 10, 14, 42]. This type of alignment measures all losses via a fixed scale based on the statistics of a large area. As a result, it might lead to the overlooking of small errors, particularly when dealing with multiple objectives such as color reconstruction or a wide range of depth variance. This overlook may matter not much in previous NeRF-based works, but can raise heavier problems in the Gaussian radiance fields.

In the Gaussian radiance fields, correcting small depth errors is more challenging because it primarily relies on the movement of Gaussian primitives, a process that happens with a minor learning rate. Also, if the primitives have not been corrected in position during depth regularization, they will become float noises and cause failures, especially in regions with detailed appearance where gathering numerous primitives, as shown in Figure 4.

Local Depth Normalization. To solve this problem, we make the loss function refocus on small errors by introducing a patch-wise local normalization. Specifically, we cut a whole depth map into small patches and normalize the patch $\mathcal{P}$ of predicted depth and monocular depth with the mean value of $0$ and standard deviation of near to $1$ :

\mathcal{D}^{LN}(x)=\frac{\mathcal{D}(x)-\text{mean}(\mathcal{D}(\mathcal{P}))% }{\text{std}(\mathcal{D}(\mathcal{P}))+\epsilon},\quad\mathrm{s.t.}\ \ x\in% \mathcal{P},

(8)

where $\epsilon$ is a value for numerical stability. Since then, all patches have been normalized on a local scale and the loss can be calculated inside. Later, we apply the Local Depth Normalization to the Hard and Soft Depth Regularization to help with geometry resha**.

Global Depth Normalization. In contrast to focusing on small local losses, we also need a global view to learn an overall shape. To fill the lack of global scale, we further add a Global Depth Normalization in the depth regularization. This makes the depth loss aware of the global scale while preserving local relevance. Similar to the local one, we apply a patch-wise normalization to free the depth from the source scale and focus on local changes. The only difference is here we use a global standard deviation of the whole image depth $\mathcal{D}_{\mathcal{I}}$ of image $\mathcal{I}$ to replace that of the patch:

	$\displaystyle\mathcal{D}^{GN}(x)=\frac{\mathcal{D}(x)-\text{mean}(\mathcal{D}(% \mathcal{P}))}{\text{std}(\mathcal{D}_{I})},$			(9)
	$\displaystyle\mathrm{s.t.}\ \ x\in\mathcal{P},\$	$\displaystyle\mathcal{P}\subseteq\mathcal{I}.$		(9)

In addition, our patch-wise normalization can also avoid long-distance errors in the monocular depth by driving the learning of locally relative depth, which serves a similar effect as depth rank distillation [42, 51]. But differently, for geometry resha** purposes, we also encourage the model to learn the absolute depth change rather than ignoring it.

3.4 Training Details

Loss Function The loss function consists of three parts: color reconstruction loss $\mathcal{L}_{color}$ , hard depth regularization $\mathcal{R}_{hard}$ and soft depth regularization $\mathcal{R}_{soft}$ . Following 3D Gaussian Splatting, the color reconstruction loss is a combination of L1 reconstruction loss and a D-SSIM term of the rendering image $\hat{\mathcal{I}}$ and ground-truth $\mathcal{I}$ :

\mathcal{L}_{color}=\mathcal{L}_{1}(\hat{\mathcal{I}},\mathcal{I})+\lambda% \mathcal{L}_{\mathrm{D-SSIM}}(\hat{\mathcal{I}},\mathcal{I}).

(10)

The depth regularization $\mathcal{R}_{hard}$ and $\mathcal{R}_{soft}$ all include a local and a global term separately from our two kinds of depth normalization. We take the L2 loss to measure the similarity. Both of the regularizations are in the form of:

\mathcal{R}_{T}=\mathcal{L}_{2}(\mathcal{D}_{T}^{GN},\widetilde{\mathcal{D}}^{% GN})+\gamma\mathcal{L}_{2}(\mathcal{D}_{T}^{LN},\widetilde{\mathcal{D}}^{LN}),

(11)

where $T$ stands for $hard$ or $soft$ . In practice, we reserve an error tolerance for the L2 loss to relax the constraint. The full loss function is formulated by:

\mathcal{L}=\mathcal{L}_{color}+\mathcal{R}_{hard}+\mathcal{R}_{soft}.

(12)

		LLFF				DTU
	Setting	PSNR $\uparrow$	LPIPS $\downarrow$	SSIM $\uparrow$	AVGE $\downarrow$	PSNR $\uparrow$	LPIPS $\downarrow$	SSIM $\uparrow$	AVGE $\downarrow$
SRF [8]		12.34	0.591	0.250	0.313	15.32	0.304	0.671	0.171
PixelNeRF [55]		7.93	0.682	0.272	0.461	16.82	0.270	0.695	0.147
MVSNeRF [5]	Trained on DTU	17.25	0.356	0.557	0.171	18.63	0.197	0.769	0.113
SRF ft [8]		17.07	0.529	0.436	0.203	15.68	0.281	0.698	0.162
PixelNeRF ft [55]		16.17	0.512	0.438	0.217	18.95	0.269	0.710	0.125
MVSNeRF ft [5]	Trained on DTU Fine-tuned per Scene	17.88	0.327	0.584	0.157	18.54	0.197	0.769	0.113
Mip-NeRF [2]		14.62	0.495	0.351	0.246	8.68	0.353	0.571	0.323
DietNeRF [16]		14.94	0.496	0.370	0.240	11.85	0.314	0.633	0.243
RegNeRF [27]		19.08	0.336	0.587	0.149	18.89	0.190	0.745	0.112
FreeNeRF [53]		19.63	0.308	0.612	0.134	19.92	0.182	0.787	0.098
SparseNeRF [42]	Optimized per Scene	19.86	0.328	0.624	0.127	19.55	0.201	0.769	0.102
3DGS [18]		15.52	0.405	0.408	0.209	10.99	0.313	0.585	0.252
3DGS†		16.46	0.401	0.440	0.192	14.74	0.249	0.672	0.169
DNGaussian (Ours)	Optimized per Scene	19.12	0.294	0.591	0.132	18.91	0.176	0.790	0.102
† with the same hyperparameters and the neural color renderer as DNGaussian .

Table 1: Quantitative Comparison on LLFF and DTU for 3 input views. The best, second-best, and third-best entries are marked in red, orange, and yellow, respectively. Notably, the Gaussian-based methods directly show the background color on the meaningless invisible places, which would cause lower metrics, especially in PSNR.

Method	PSNR $\uparrow$	SSIM $\uparrow$	LPIPS $\downarrow$
NeRF [25]	14.934	0.687	0.318
NeRF (Simplified) [16]	20.092	0.822	0.179
DietNeRF [16]	23.147	0.866	0.109
DietNeRF + ft [16]	23.591	0.874	0.097
FreeNeRF [53]	24.259	0.883	0.098
SparseNeRF [42]	22.410	0.861	0.119
3DGS [18]	22.226	0.858	0.114
DNGaussian (Ours)	24.305	0.886	0.088

Table 2: Quantitative Comparison on Blender for 8 input views. The best, second-best, and third-best entries are marked in red, orange, and yellow, respectively.

Neural Color Renderer. 3D Gaussian Splatting stores color via spherical harmonics, however, it is easy to overfit with only sparse views. To relieve this problem, we take a grid encoder and an MLP as the Neural Color Renderer to predict color for each primitive (Figure 3). During inference, we store the intermediate result and only calculate the last MLP layers to merge view direction for acceleration.

4 Experiments

4.1 Setups

Datasets. we conduct our experiment on three datasets: the NeRF Blender Synthetic dataset (Blender) [25], the DTU dataset [17], and the LLFF dataset [24]. We follow the setting used in previous works [27, 53, 42] with the same split of DTU and LLFF to train the model on 3 views and test on another set of images. To erase the noises in the background and focus on the target object, we apply the same object masks as previous works [27] for DTU at evaluation. For Blender, we follow DietNeRF [16] and FreeNeRF [53] to train with the same 8 views and test on 25 unseen images. Aligned with the baselines, downsampling rates of $8$ , $4$ , and $2$ are applied to LLFF, DTU, and Blender. Following previous sparse-view settings, the camera poses are assumed to be known via calibration or other ways.

Evaluation Metrics. We report PSNR, SSIM [45], and LPIPS [59] scores to evaluate the reconstruction performance quantitatively. Also, an Average Error (AVGE) [27] is reported by the geometric mean of $\text{MSE}=10^{-\text{PSNR}/10}$ , $\sqrt{1-\text{SSIM}}$ , and LPIPS.

Baselines. Following the previous sparse-view neural fields [27, 16, 53, 42], We take current SOTA methods SRF [8], PixelNeRF [55], MVSNeRF [5], Mip-NeRF [2], DietNeRF [16], RegNeRF [27], FreeNeRF [53], and SparseNeRF [42] as our baselines. For most NeRF-based methods, we directly report their best quantitative results in corresponding published papers for comparisons. The results of raw 3D Gaussian Splatting (3DGS) [18] are also reported.

Implementations. We build our models on the official PyTorch 3D Gaussian Splatting codebase. We train the model with $6,000$ iterations for all datasets, and the soft depth regularization is applied after $1,000$ iterations for stability. We set $\gamma=0.1,\tau=0.95$ in loss functions for all experiments. The neural renderer consists of a hash encoder [26] with $16$ levels in a resolution range of $16$ to $512$ and a max size of $2^{19}$ , and a $5$ layer MLP with the hidden dim of $64$ . We use DPT [29] to predict monocular depth maps for all input views. The models of 3DGS and DNGaussian are randomly initialized with a uniform distribution.

4.2 Comparison

LLFF. The qualitative results and visualizations on the LLFF dataset from 3 input views are reported in Table 1 and Figure 5. Notably, since the NeRF-based baselines would interpolate colors to those invisible areas from input views while the discrete Gaussian radiance fields directly expose the black background on these empty spaces, the 3DGS-based methods natively have a weakness in the reconstruction metrics from these meaningless invisible areas. Despite that, our approach still outperforms all baselines in the LPIPS score, and achieves comparable PSNR, SSIM, and Average Error to the best methods. From both the quantitative and qualitative results, we can see that our DNGaussian predicts more fine details and precise geometry. FreeNeRF tends to synthesize smooth views that lack high-frequency details, also the geometry is not as accurate as the depth-supervised SparseNeRF and Our DNGaussian. Although regularized by same depth maps, SparseNeRF performs more weak in details and geometry completeness. DNGaussian also has huge improvements in both image geometry quality compared to the well-tuned 3DGS.

DTU. The quantitative results on the DTU 3-view setting reported in Table 1 show that our method achieves the best in LPIPS and SSIM, and the second best in Average Error. However, we got a lower score in PSNR, which is mainly due to scale variance and the noise occlusion coming from the textureless board and background in the scene. In the qualitative examples in Figure 6, It can be observed that our method can learn a more correct and complete geometry compared with both FreeNeRF and the depth-supervised SparseNeRF, and produces high-quality details even on difficult plush and reflective areas.

Blender. To test the fitting ability from surrounding views, we make an evaluation of the Blender dataset under 8 input views. The scores are reported in Table 2, in which some data come from FreeNeRF [53] and DietNeRF [16]. Our method has got the best scores in all PSNR, SSIM and LPIPS. From the qualitative results in Figure 7, it can be seen that our method synthesizes views with correct geometry and fewer floaters compared to the vanilla 3DGS, and has a better performance in detail compared to the second-best FreeNeRF. The results demonstrate that DNGaussian can not only handle looking-forward scenes like LLFF and DTU, but also a whole reconstruction of complex objects with transparent and reflective materials.

Efficiency. We further conduct an efficiency study on the LLFF 3-view setting with RTX 3090 Ti GPUs to explore the performance of current SOTA baselines [42, 53] with limited GPU memories of 24GB/12GB, and training time of 1.0h/0.5h, as shown in Table 3. The top row of each group represents the default setting of the corresponding baseline, where the training time is measured by us with the same number of iterations on a single GPU. While both FreeNeRF and SparseNeRF perform worse under strict resource limitations, our method shows huge advantages in efficiency, which achieves remarkable accelerations of $25\times$ on training time and over $3000\times$ on FPS, while synthesizing competitive quality novel views. Given the necessity for per-scene optimization and rapid visualization, our high efficiency holds significant value for practical applications.

Method	FPS	Time	GPU Mem	PSNR $\uparrow$	LPIPS $\downarrow$	SSIM $\uparrow$
FreeNeRF [53]	9 $\times$ 10 ${}^{-2}$	2.3 h	4 $\times$ 48 GB	19.63	0.308	0.612
		2.3 h	24 GB	19.71	0.322	0.603
		1.0 h	24 GB	19.66	0.357	0.574
SparseNeRF [42]	9 $\times$ 10 ${}^{-2}$	1.5 h	32 GB	19.86	0.328	0.624
		1.5 h	12 GB	19.95	0.334	0.598
		0.5 h	12 GB	19.94	0.341	0.585
Ours	300	3.5 min	2 GB	19.12	0.294	0.591

Table 3: Efficiency Comparison with Limited Resources. Our method achieves efficient training and the fastest real-time rendering while synthesizing competitive high-quality novel views.

Regularization		Normalization		PSNR $\uparrow$	LPIPS $\downarrow$	SSIM $\uparrow$
AP	Hard Soft	Local	Global	PSNR $\uparrow$	LPIPS $\downarrow$	SSIM $\uparrow$
✓				18.14	0.354	0.538
	✓			17.90	0.351	0.525
✓			✓	18.31	0.339	0.552
✓		✓	✓	18.68	0.331	0.565
	✓		✓	18.55	0.324	0.562
	✓	✓	✓	19.12	0.294	0.591

Table 4: Ablation Study. We ablate our method on the LLFF 3-view setting. The results show the effect of our contributions.

4.3 Ablation Study

We ablate our method on the LLFF 3-view setting. The quantitative results are reported in Table 4 and 5.

Depth Regularization. We ablate our Hard and Soft Depth Regularization with a plain all-parameter (AP) L2 reconstruction loss term. To better separately illustrate the effect of our two types of depth and exclude the influence of shape freezing, we further visualize a comparison to the situation only with shape freezing in Figure 8. It has been shown that the plain depth regularization can not effectively reshape the scene geometry, which proves the necessity of our method. Both the qualitative and quantitative results demonstrate our effect on geometry quality and high-frequency details.

Global-Local Depth Normalization. From the result, we can observe that only adding a global normalization can also help fitting, which is mainly due to the local patch-wise loss computation. After the attendance of local normalization, the rendering quality improves especially in detail. These improvements are much more obvious when applied to our proposed regularization than the all-parameter regularization that is unsuitable for the fields. The results correspond to our design and show the effectiveness of our Global-Local Depth Normalization.

Parameter Freezing. To illustrate the effect of our parameter-freezing strategy, we also ablate the shape freezing in regularization and center freezing in soft depth calculation. The results are shown in Table 5 and Figure 9. The visualization illustrates the problem of the depth-color conflict in Sec.3.2. In the situation without center freezing, some primitives may move to unexpected places to compensate for the depth loss, which causes lower quality. By introducing the proposed parameter freezing, we successfully relieve the problems and achieve the best results.

Setting	PSNR $\uparrow$	LPIPS $\downarrow$	SSIM $\uparrow$	AVGE $\downarrow$
w/o Shape Freezing	17.96	0.363	0.547	0.160
w/o Center Freezing	18.87	0.300	0.584	0.140
All	19.12	0.294	0.591	0.132

Table 5: Ablation Study on Parameter Freezing. The results demonstrate the necessity of our parameter freezing strategy.

5 Conclusion

This paper presents the DNGaussian framework that introduces 3DGS into the few-shot novel view synthesis task by depth regularization. Due to the space limitation, we have put more discussions in the supplementary material.

Acknowledgements. In this work, we are supported by the National Natural Science Foundation of China 62276016, 62372029. Lin Gu is supported by JST Moonshot R&D Grant Number JPMJMS2011 Japan.

\thetitle

Supplementary Material

Overview

In the supplemental document, we first report additional studies in Sec. A of our proposed depth normalization, neural color renderer, and the performance of previous methods on fast grid-based backbones. Then, we describe the details of our implementation and dataset settings in our experiment in Sec. B. Finally, we discuss the limitations and future work of our method in Sec. C.

Appendix A Additional Results

A.1 Ablation Study on Depth Normalization

To better illustrate the roles of our Local and Global Depth Normalization, we conduct an additional ablation study and replace the L2 loss function with L1 to avoid its reduction of small losses. The quantitative visualization results are shown in Table 6 and Figure 10. In the comparison, we separately apply the Global and the Local one to illustrate the strengths and weaknesses of each: 1) Although the global one can also individually support the model to learn an overall scene, it is weak in optimizing minor errors, as we have discussed in Sec.3.3. 2) The local one can not stand alone due to the lack of absolute scale, but provides rich information on local depth changes. 3) By combining both techniques, our Global-Local Depth Normalization can simultaneously obtain the knowledge of both global scale and small local errors and achieve the best. Notably, since a different type of loss is used in this study, the scores vary from those reported in the main paper. Despite this, our method still performs well particularly in LPIPS and SSIM, which demonstrates the robustness of our depth normalization.

Setting	PSNR $\uparrow$	LPIPS $\downarrow$	SSIM $\uparrow$	AVGE $\downarrow$
Only Global	18.32	0.309	0.579	0.144
Only Lobal	17.17	0.338	0.523	0.167
All	18.67	0.291	0.595	0.137

Table 6: Additional Ablation Study on Depth Normalization. Combined with both two proposed depth normalizations, our Global-Local Depth Normalization achieves the best quality.

A.2 Ablation Study on Neural Color Renderer

In this work, we replace the spherical harmonic (SH) of 3D Gaussian Splatting with a neural color renderer to represent the direction-variant color. To better illustrate the function of this module, we compare it to the original SH function with different degrees in the LLFF dataset with 3 training views. The results are in Table 7 and Figure 11. The SH function is easy to overfit in the sparse-view situation and results in some strange colors during view changing. This may be caused by the independence of each primitive which leads to a lack of regional consistency. After introducing the neural color renderer, the problem has been relieved. By storing the intermediate result and only calculating the latest two MLP layers, we can maintain a fast rendering speed competitive to SH as well.

Setting	PSNR $\uparrow$	LPIPS $\downarrow$	SSIM $\uparrow$	AVGE $\downarrow$	FPS
SH degree=2	17.06	0.333	0.549	0.167	340
SH degree=3	17.11	0.328	0.560	0.164	300
Neural Renderer	19.12	0.294	0.591	0.132	300

Table 7: Ablation Study on Neural Color Renderer. Our neural color renderer successfully improves the rendering quality while kee** an equally fast inference speed.

Backbone	Strategy	PSNR $\uparrow$	LPIPS $\downarrow$	SSIM $\uparrow$	Time $\downarrow$	VM Cost $\downarrow$	FPS $\uparrow$
Mip-NeRF [2]	None	14.62	0.495	0.351	2.2h	$\geq$ 32 GB	0.09
	FreeNeRF	19.63	0.308	0.612	2.3h
	SparseNeRF	19.86	0.328	0.624	1.5h
Instant-NGP [26]	None	17.19	0.483	0.469	3.8min	3 GB	3
	FreeNeRF	15.30	0.516	0.369	4.2min
	SparseNeRF	17.19	0.478	0.476	7.5min
TensoRF [6]	None	16.16	0.454	0.443	4.1min	8 GB	5
	FreeNeRF	15.78	0.466	0.430	4.5min
	SparseNeRF	16.11	0.465	0.443	8.9min
3DGS [18]	None	16.46	0.401	0.440	2.7min	2 GB	280
	FreeNeRF	16.55	0.399	0.472	2.7min
	SparseNeRF	16.80	0.374	0.504	2.9min
3DGS [18]	Ours	19.12	0.294	0.591	3.5min	2 GB	300

Table 8: Comparision of SOTA strategies FreeNeRF [53] and SparseNeRF [42] with different backbones. The best results for all and for each backbone are marked with bold and underline. Although FreeNeRF and SparseNeRF perform well on the implicit Mip-NeRF [2], they can weakly improve the quality with current fast backbones Instant-NGP [26], TensoRF [6], and also the 3DGS [18] in our work.

A.3 Transfer of Previous Strategies

In this section, we conduct an experiment to illustrate the necessity of our efficient DNGaussian. Indeed, there are some existing methods like FreeNeRF [53] and SparseNeRF [42] that are low in efficiency mainly due to their backbone rather than the strategy itself. However, they have only already been proven effective for some implicit backbones that are slow and costly. To verify whether they can directly transfer to faster backbones to achieve higher efficiency, we implement these two methods on two fast grid-based Instant-NGP [26] and TensoRF [6]. Also, we do this on our 3D Gaussian Splatting (3DGS) [18] backbone. Then, we test these implementations in the LLFF 3-view setting. The results are shown in Table 8 and Figure 12. Additionally, we report the training time (Time), GPU memory cost (VM Cost), and the inference FPS for each item.

Implementation details. We utilize a CUDA-implemented ray marching ¹¹1https://github.com/ashawkey/torch-ngp for the two grid-based backbones to achieve faster speed and lower costs. The 3DGS backbone employs the same neural color renderer as our method. We follow the original implementation of SparseNeRF to produce monocular depth maps for all input views and transfer its Local Depth Ranking Distillation to these new backbones with the same hyperparameters. For FreeNeRF, since the three fast backbones do not contain a frequency-based positional encoding, we apply the Frequency Regularization to their grid-based positional encoding as an alternative.

Comparison on grid-based backbones. Although these two methods perform well on their original implicit Mip-NeRF, they are weak in both Instant-NGP and TensoRF. SparseNeRF distills the depth ranking from the monocular depth map for regularization, however, causes more blurs. This may be caused by the stronger spatial memory ability from the explicit grids that makes it easier to memorize noises. FreeNeRF performs even worse on both these two backbones, which may be due to the different representations of positional encoding. In TensoRF, all these two strategies fail to improve performance. One reason may lie in that TensoRF directly utilizes explicit grids without a neural decoder to store density value, which is more difficult to regularize.

Comparison on 3DGS. In the comparison, the 3DGS backbone achieves the best efficiency, with the fast FPS and lowest training cost. However, both SparseNeRF and FreeNeRF cannot effectively regularize this powerful and efficient backbone. Due to the lack of frequency positional encoding, FreeNeRF serves more like a coarse-to-fine strategy and leads to only a little improvement. From the visualization of SparseNeRF in Figure 12, it can be observed that it is insufficient in the 3D Gaussian radiance fields of 3DGS to only keep the depth ranking and wait for the color-supervised optimization process to refine the detailed geometry. Compared with these two methods, our DNGaussian achieves a much better quality with only a little increment of training time. With less noise in the learned geometry, our method also achieves a faster inference speed.

Conclusion. The experiments show that the previous methods for implicit backbones can hardly, at least in an easy way, transfer to current fast backbones. Also, they are not suitable for the 3D Gaussian radiance fields. In such a situation, our DNGaussian shows significant value in providing an efficient way for high-quality and low-cost few-shot novel view synthesis.

A.4 Comparison with Grid-based Methods

There are some works [49, 38, 36] that utilize a grid-based backbone to improve the training efficiency. Since DaRF [36] is evaluated on another two datasets with at least 9 input views, while VGOS [38] and DiffusioNeRF [49] use different methods for the measurement of metrics, we do not take them as baselines in the main paper. Here we list the scores of VGOS and DiffusioNeRF in Table 9 in the LLFF 3-view setting for comparison. For VGOS, we only report scores for which the measurement method is definitely the same as in RegNeRF [27] and FreeNeRF [53]. The results of DiffusioNeRF are obtained from its updated paper on arXiv ²²2https://arxiv.longhoe.net/abs/2302.12231. In the comparison, our method outperforms the other two with the highest scores in LPIPS, SSIM, and AVGE. In fact, our method also achieves the best in efficiency, with much lower cost and faster inference.

A.5 Additional Visualizations

We provide more rendering results in our experiments. The examples on DTU and LLFF with 3 training views are shown in Figure 14 and 15. We have also shown more quantitative comparison in the Blender 8-view setting with the SOTA method FreeNeRF [53] in Figure 13. More results can be found in our supplementary video.

Method	PSNR $\uparrow$	LPIPS $\downarrow$	SSIM $\uparrow$	AVGE $\downarrow$
VGOS [38]	19.35	0.432	-	-
DiffusioNeRF [49]	19.79	0.338	0.568	0.136
Ours	19.12	0.294	0.591	0.132

Table 9: Comparison with grid-based few-shot NeRFs on LLFF with 3 training views. Our method outperforms grid-based methods VGOS [38] and DiffusioNeRF [38].

Appendix B Details

B.1 Implementations

Pre-trained Depth Models. In this work, we use the pre-trained DPT [29, 30] estimator to predict the depth map, which has been widely used in many NeRF-based works [10, 42, 4, 48]. Particularly, we use the type of dpt_hybrid_384 for the LLFF dataset, while dpt_large_384 for DTU and Blender, which performs better for the pure white or black background. In fact, the performance gaps of our method when applying different types of depth models are slight, as shown in Table 10.

LLFF
Type	PSNR $\uparrow$	LPIPS $\downarrow$	SSIM $\uparrow$	AVGE $\downarrow$
dpt_hybrid_384*	19.12	0.294	0.591	0.132
dpt_large_384	19.03	0.297	0.590	0.135
DTU
dpt_hybrid_384	18.86	0.179	0.784	0.106
dpt_large_384*	18.91	0.176	0.790	0.102

Table 10: The influence of different pre-trained depth models. We replace the pre-trained depth model with a different type in our LLFF and DTU settings while kee** the same hyperparameters. The results show that our method is robust to different monocular depth estimators. * denotes the default type of each dataset.

Patch Size. In our implementation, we randomly sample a patch size from a pre-defined range for our patch-based Global-Local Depth Normalization. This range is set of $[5,17]$ for LLFF and Blender, and a larger $[17,51]$ for DTU since the objects are smaller but occupy a large proportion of the image. Due to the flexibility of our normalization, we do not need to separately tune this value for each scene.

Metrics. Following previous methods [27, 53], we utilize the “structural_similarity” API from scikit-image ³³3https://scikit-image.org/docs/stable/api/skimage.metrics.html to compute the SSIM score, and use the implementation with a pre-trained VGG model to calculate the LPIPS score.

Camera Poses. Following existing works [27, 53, 42, 16], we assume all camera poses are already known. In practice, for LLFF and Blender, we use the given poses from the datasets. For DTU, we use COLMAP [34, 33] to calculate the camera poses according to all given views, and then sample the target sparse views from the results.

B.2 Datasets

LLFF. The LLFF dataset [24] contains 8 forward-facing scenes in total. Following [27, 53, 42], we take every 8-th image as the novel views for testing. The input views are evenly sampled across the remaining views. Images are downsampled $8\times$ to the resolution of $378\times 504$ . In practice, we ignore the distortion of the original images.

DTU. The DTU dataset [17] consists of 124 object-centric scenes captured by a set of fixed cameras. We follow [27, 53, 42] to evaluate models directly on the 15 scenes with the scan IDs of 8, 21, 30, 31, 34, 38, 40, 41, 45, 55, 63, 82, 103, 110, and 114. In each scan, the images with the following IDs of 25, 22, and 28 are used as the input views in our 3-view setting. The test set consists of images with IDs of 1, 2, 9, 10, 11, 12, 14, 15, 23, 24, 26, 27, 29, 30, 31, 32, 33, 34, 35, 41, 42, 43, 45, 46 and 47 for evaluation. The images are downsampled $4\times$ . In particular, we use the undistorted images from COLMAP to eliminate the negative impact of unerased lens distortion.

Blender. We follow the data split used in [16, 53] for the Blender dataset [11]. The 8 input views are selected from the training images, with IDs 26, 86, 2, 55, 75, 93, 16, 73, and 8. The 25 test views are sampled evenly from the testing images for evaluation. All images are downsampled $2\times$ to $400\times 400$ during the experiment.

Appendix C Discussions and Limitations

Our DNGaussian utilizes coarse monocular depth to regularize the scene geometry in situations with sparse input views, and achieves significant improvement in the appearance quality. However, our method still has limitations such as below. We hope these issues can be solved in future work.

More Input Views. Besides only 3 input views, we have also explored the performance when the number of input views increases to 6 and 9 on the LLFF dataset, as shown in Table 11. In the experiment, it can be observed that as the number of views increases, the performance of the baseline also improves. Our DNGaussian can still improve the quality of the synthesized novel view with 6 input views. However, it does not work well when the number of input views increases to 9, which is nearly enough to provide sufficient color constraints. This may be due to the errors in the depth map that negatively influence the optimization process. The next step of our work can lie in leveraging the uncertainty of the monocular depth to filter out unreliable supervision.

Views	Method	PSNR $\uparrow$	LPIPS $\downarrow$	SSIM $\uparrow$	AVGE $\downarrow$
3	3DGS	15.52	0.405	0.408	0.209
	3DGS†	16.46	0.401	0.440	0.192
	DNGaussian	19.12	0.294	0.591	0.132
6	3DGS	20.63	0.226	0.699	0.108
	3DGS†	21.09	0.229	0.699	0.103
	DNGaussian	22.18	0.198	0.755	0.088
9	3DGS	20.44	0.230	0.697	0.108
	3DGS†	23.21	0.176	0.785	0.076
	DNGaussian	23.17	0.180	0.788	0.077

Table 11: Comparison with 3, 6, and 9 input views on LLFF dataset. † denotes applied with the same hyperparameters and the neural color renderer as DNGaussian.

Solid Color Planes. The anisotropic shape of the Gaussian primitive makes it difficult to represent a solid color plane in a situation with sparse input views. First, the primitives are hard to constrain both by color and depth in the region of the plane, which may cause ray-like noises and hollows. Also, since they can freely move to other regions with similar colors, the densification operation can be activated more frequently and generate noises. This is hoped solved by additional geometry priors.

Specular Regions. Although our method can handle some specular regions by relying on depth supervision, especially from our Local Depth Normalization, the inconsistent appearances in these regions are still challenging for 3DGS. To completely solve this problem may still need more special designs.

Hollows and Cracks. The splatting technique of our Gaussian Splatting [18] backbone directly merges existing primitives to render the pixel-level color without interpolation. However, since not every pixel can be overlapped by the projected primitives, the empty space between two Gaussian primitives would cause hollows and cracks when the camera pose changes. For example, some hollows can be seen at Scan 40 in Figure 14. In this work, we try to solve this problem by paying more attention to high-frequency details and therefore encouraging the densifying of primitives to fill these empty areas. In the future, we believe this problem can be fundamentally solved by the improvement of the representation itself.

References

Avidan and Shashua [1997] Shai Avidan and Amnon Shashua. Novel view synthesis in tensor space. In Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 1034–1040. IEEE, 1997.
Barron et al. [2021] Jonathan T Barron, Ben Mildenhall, Matthew Tancik, Peter Hedman, Ricardo Martin-Brualla, and Pratul P Srinivasan. Mip-nerf: A multiscale representation for anti-aliasing neural radiance fields. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5855–5864, 2021.
Barron et al. [2022] Jonathan T Barron, Ben Mildenhall, Dor Verbin, Pratul P Srinivasan, and Peter Hedman. Mip-nerf 360: Unbounded anti-aliased neural radiance fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5470–5479, 2022.
Bian et al. [2023] Wen**g Bian, Zirui Wang, Kejie Li, Jia-Wang Bian, and Victor Adrian Prisacariu. Nope-nerf: Optimising neural radiance field with no pose prior. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4160–4169, 2023.
Chen et al. [2021] Anpei Chen, Zexiang Xu, Fuqiang Zhao, Xiaoshuai Zhang, Fanbo Xiang, **gyi Yu, and Hao Su. Mvsnerf: Fast generalizable radiance field reconstruction from multi-view stereo. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 14124–14133, 2021.
Chen et al. [2022] Anpei Chen, Zexiang Xu, Andreas Geiger, **gyi Yu, and Hao Su. Tensorf: Tensorial radiance fields. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXII, pages 333–350. Springer, 2022.
Chen et al. [2023] Zhang Chen, Zhong Li, Liangchen Song, Lele Chen, **gyi Yu, Junsong Yuan, and Yi Xu. Neurbf: A neural fields representation with adaptive radial basis functions. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4182–4194, 2023.
Chibane et al. [2021] Julian Chibane, Aayush Bansal, Verica Lazova, and Gerard Pons-Moll. Stereo radiance fields (srf): Learning view synthesis for sparse views of novel scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7911–7920, 2021.
Cong et al. [2023] Wenyan Cong, Hanxue Liang, Peihao Wang, Zhiwen Fan, Tianlong Chen, Mukund Varma, Yi Wang, and Zhangyang Wang. Enhancing nerf akin to enhancing llms: Generalizable nerf transformer with mixture-of-view-experts. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3193–3204, 2023.
Deng et al. [2023] Congyue Deng, Chiyu Jiang, Charles R Qi, Xinchen Yan, Yin Zhou, Leonidas Guibas, Dragomir Anguelov, et al. Nerdi: Single-view nerf synthesis with language-guided diffusion as general image priors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20637–20647, 2023.
Deng et al. [2022] Kangle Deng, Andrew Liu, Jun-Yan Zhu, and Deva Ramanan. Depth-supervised nerf: Fewer views and faster training for free. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12882–12891, 2022.
Fridovich-Keil et al. [2022] Sara Fridovich-Keil, Alex Yu, Matthew Tancik, Qinhong Chen, Benjamin Recht, and Angjoo Kanazawa. Plenoxels: Radiance fields without neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5501–5510, 2022.
Fridovich-Keil et al. [2023] Sara Fridovich-Keil, Giacomo Meanti, Frederik Rahbæk Warburg, Benjamin Recht, and Angjoo Kanazawa. K-planes: Explicit radiance fields in space, time, and appearance. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12479–12488, 2023.
Hu et al. [2023a] Shoukang Hu, Kaichen Zhou, Kaiyu Li, Longhui Yu, Lanqing Hong, Tianyang Hu, Zhenguo Li, Gim Hee Lee, and Ziwei Liu. Consistentnerf: Enhancing neural radiance fields with 3d consistency for sparse view synthesis. arXiv preprint arXiv:2305.11031, 2023a.
Hu et al. [2023b] Wenbo Hu, Yuling Wang, Lin Ma, Bangbang Yang, Lin Gao, Xiao Liu, and Yuewen Ma. Tri-miprf: Tri-mip representation for efficient anti-aliasing neural radiance fields. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 19774–19783, 2023b.
Jain et al. [2021] Ajay Jain, Matthew Tancik, and Pieter Abbeel. Putting nerf on a diet: Semantically consistent few-shot view synthesis. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5885–5894, 2021.
Jensen et al. [2014] Rasmus Jensen, Anders Dahl, George Vogiatzis, Engin Tola, and Henrik Aanæs. Large scale multi-view stereopsis evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 406–413, 2014.
Kerbl et al. [2023] Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering. ACM Transactions on Graphics (ToG), 42(4):1–14, 2023.
Kim et al. [2022] Mijeong Kim, Seonguk Seo, and Bohyung Han. Infonerf: Ray entropy minimization for few-shot neural volume rendering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12912–12921, 2022.
Kulhánek et al. [2022] Jonáš Kulhánek, Erik Derner, Torsten Sattler, and Robert Babuška. Viewformer: Nerf-free neural rendering from few images using transformers. In European Conference on Computer Vision (ECCV), 2022.
Liu et al. [2020] Lingjie Liu, Jiatao Gu, Kyaw Zaw Lin, Tat-Seng Chua, and Christian Theobalt. Neural sparse voxel fields. Advances in Neural Information Processing Systems, 33:15651–15663, 2020.
Liu et al. [2023] Ruoshi Liu, Rundi Wu, Basile Van Hoorick, Pavel Tokmakov, Sergey Zakharov, and Carl Vondrick. Zero-1-to-3: Zero-shot one image to 3d object. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9298–9309, 2023.
Luiten et al. [2023] Jonathon Luiten, Georgios Kopanas, Bastian Leibe, and Deva Ramanan. Dynamic 3d gaussians: Tracking by persistent dynamic view synthesis. arXiv preprint arXiv:2308.09713, 2023.
Mildenhall et al. [2019] Ben Mildenhall, Pratul P Srinivasan, Rodrigo Ortiz-Cayon, Nima Khademi Kalantari, Ravi Ramamoorthi, Ren Ng, and Abhishek Kar. Local light field fusion: Practical view synthesis with prescriptive sampling guidelines. ACM Transactions on Graphics (TOG), 38(4):1–14, 2019.
Mildenhall et al. [2021] Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM, 65(1):99–106, 2021.
Müller et al. [2022] Thomas Müller, Alex Evans, Christoph Schied, and Alexander Keller. Instant neural graphics primitives with a multiresolution hash encoding. ACM Transactions on Graphics (ToG), 41(4):1–15, 2022.
Niemeyer et al. [2022] Michael Niemeyer, Jonathan T Barron, Ben Mildenhall, Mehdi SM Sajjadi, Andreas Geiger, and Noha Radwan. Regnerf: Regularizing neural radiance fields for view synthesis from sparse inputs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5480–5490, 2022.
Qian et al. [2023] Guocheng Qian, **jie Mai, Abdullah Hamdi, Jian Ren, Aliaksandr Siarohin, Bing Li, Hsin-Ying Lee, Ivan Skorokhodov, Peter Wonka, Sergey Tulyakov, et al. Magic123: One image to high-quality 3d object generation using both 2d and 3d diffusion priors. arXiv preprint arXiv:2306.17843, 2023.
Ranftl et al. [2021] René Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vision transformers for dense prediction. ICCV, 2021.
Ranftl et al. [2022] René Ranftl, Katrin Lasinger, David Hafner, Konrad Schindler, and Vladlen Koltun. Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(3), 2022.
Roessle et al. [2022] Barbara Roessle, Jonathan T Barron, Ben Mildenhall, Pratul P Srinivasan, and Matthias Nießner. Dense depth priors for neural radiance fields from sparse input views. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12892–12901, 2022.
Sargent et al. [2023] Kyle Sargent, Zizhang Li, Tanmay Shah, Charles Herrmann, Hong-Xing Yu, Yunzhi Zhang, Eric Ryan Chan, Dmitry Lagun, Li Fei-Fei, Deqing Sun, et al. Zeronvs: Zero-shot 360-degree view synthesis from a single real image. arXiv preprint arXiv:2310.17994, 2023.
Schönberger and Frahm [2016] Johannes Lutz Schönberger and Jan-Michael Frahm. Structure-from-Motion Revisited. In Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
Schönberger et al. [2016] Johannes Lutz Schönberger, Enliang Zheng, Marc Pollefeys, and Jan-Michael Frahm. Pixelwise View Selection for Unstructured Multi-View Stereo. In European Conference on Computer Vision (ECCV), 2016.
Seo et al. [2023] Seunghyeon Seo, Yeon** Chang, and Nojun Kwak. Flipnerf: Flipped reflection rays for few-shot novel view synthesis. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 22883–22893, 2023.
Song et al. [2023] Jiuhn Song, Seonghoon Park, Honggyu An, Seokju Cho, Min-Seop Kwak, Sung** Cho, and Seungryong Kim. Därf: Boosting radiance fields from sparse inputs with monocular depth adaptation, 2023.
Sun et al. [2022] Cheng Sun, Min Sun, and Hwann-Tzong Chen. Direct voxel grid optimization: Super-fast convergence for radiance fields reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5459–5469, 2022.
Sun et al. [2023] Jiakai Sun, Zhanjie Zhang, Jiafu Chen, Guangyuan Li, Boyan Ji, Lei Zhao, and Wei Xing. Vgos: Voxel grid optimization for view synthesis from sparse inputs. In Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, IJCAI-23, pages 1414–1422. International Joint Conferences on Artificial Intelligence Organization, 2023. Main Track.
Tang et al. [2023] Jiaxiang Tang, Jiawei Ren, Hang Zhou, Ziwei Liu, and Gang Zeng. Dreamgaussian: Generative gaussian splatting for efficient 3d content creation. arXiv preprint arXiv:2309.16653, 2023.
Uy et al. [2023] Mikaela Angelina Uy, Ricardo Martin-Brualla, Leonidas Guibas, and Ke Li. Scade: Nerfs from space carving with ambiguity-aware depth estimates. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16518–16527, 2023.
Wang et al. [2022] Chen Wang, Xiang Wang, Jiawei Zhang, Liang Zhang, Xiao Bai, Xin Ning, Jun Zhou, and Edwin Hancock. Uncertainty estimation for stereo matching based on evidential deep learning. Pattern Recognition, 124:108498, 2022.
Wang et al. [2023] Guangcong Wang, Zhaoxi Chen, Chen Change Loy, and Ziwei Liu. Sparsenerf: Distilling depth ranking for few-shot novel view synthesis. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 9065–9076, 2023.
Wang et al. [2021] Xiang Wang, Chen Wang, Bing Liu, Xiaoqing Zhou, Liang Zhang, ** Zheng, and Xiao Bai. Multi-view stereo in the deep learning era: A comprehensive review. Displays, 70:102102, 2021.
Wang et al. [2024a] Xiang Wang, Haonan Luo, Zihang Wang, ** Zheng, and Xiao Bai. Robust training for multi-view stereo networks with noisy labels. Displays, 81:102604, 2024a.
Wang et al. [2004] Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, 13(4):600–612, 2004.
Wang et al. [2024b] Zihang Wang, Haonan Luo, Xiang Wang, ** Zheng, Xin Ning, and Xiao Bai. A contrastive learning based unsupervised multi-view stereo with multi-stage self-training strategy. Displays, page 102672, 2024b.
Wu et al. [2023] Guanjun Wu, Taoran Yi, Jiemin Fang, Lingxi Xie, Xiaopeng Zhang, Wei Wei, Wenyu Liu, Qi Tian, and Xinggang Wang. 4d gaussian splatting for real-time dynamic scene rendering. arXiv preprint arXiv:2310.08528, 2023.
Wu et al. [2022] Zi** Wu, Xingyi Li, Juewen Peng, Hao Lu, Zhiguo Cao, and Weicai Zhong. Dof-nerf: Depth-of-field meets neural radiance fields. In Proceedings of the 30th ACM International Conference on Multimedia, pages 1718–1729, 2022.
Wynn and Turmukhambetov [2023] Jamie Wynn and Daniyar Turmukhambetov. Diffusionerf: Regularizing neural radiance fields with denoising diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4180–4189, 2023.
Xiong et al. [2023] Haolin Xiong, Sairisheek Muttukuru, Rishi Upadhyay, Pradyumna Chari, and Achuta Kadambi. Sparsegs: Real-time 360° sparse view synthesis using gaussian splatting. arXiv preprint arXiv:2312.00206, 2023.
Xu et al. [2023] Dejia Xu, Yifan Jiang, Peihao Wang, Zhiwen Fan, Yi Wang, and Zhangyang Wang. Neurallift-360: Lifting an in-the-wild 2d photo to a 3d object with 360deg views. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4479–4489, 2023.
Xu et al. [2022] Qiangeng Xu, Zexiang Xu, Julien Philip, Sai Bi, Zhixin Shu, Kalyan Sunkavalli, and Ulrich Neumann. Point-nerf: Point-based neural radiance fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5438–5448, 2022.
Yang et al. [2023] Jiawei Yang, Marco Pavone, and Yue Wang. Freenerf: Improving few-shot neural rendering with free frequency regularization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8254–8263, 2023.
Yu et al. [2021a] Alex Yu, Ruilong Li, Matthew Tancik, Hao Li, Ren Ng, and Angjoo Kanazawa. Plenoctrees for real-time rendering of neural radiance fields. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5752–5761, 2021a.
Yu et al. [2021b] Alex Yu, Vickie Ye, Matthew Tancik, and Angjoo Kanazawa. pixelnerf: Neural radiance fields from one or few images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4578–4587, 2021b.
Yu et al. [2022] Zehao Yu, Songyou Peng, Michael Niemeyer, Torsten Sattler, and Andreas Geiger. Monosdf: Exploring monocular geometric cues for neural implicit surface reconstruction. Advances in neural information processing systems, 35:25018–25032, 2022.
Zhang et al. [2022a] Chi Zhang, Wei Yin, Billzb Wang, Gang Yu, Bin Fu, and Chunhua Shen. Hierarchical normalization for robust monocular depth estimation. Advances in Neural Information Processing Systems, 35:14128–14139, 2022a.
Zhang et al. [2022b] Jiawei Zhang, Xiang Wang, Xiao Bai, Chen Wang, Lei Huang, Yimin Chen, Lin Gu, Jun Zhou, Tatsuya Harada, and Edwin R Hancock. Revisiting domain generalized stereo matching networks from a feature consistency perspective. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13001–13011, 2022b.
Zhang et al. [2018] Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018.
Zhou et al. [2016] Tinghui Zhou, Shubham Tulsiani, Weilun Sun, Jitendra Malik, and Alexei A Efros. View synthesis by appearance flow. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part IV 14, pages 286–301. Springer, 2016.
Zhou et al. [2023] Xiaoqing Zhou, Xiang Wang, ** Zheng, and Xiao Bai. Adaptive spatial sparsification for efficient multi-view stereo matching. Acta Electronica Sinica, 51(11):3079–3091, 2023.
Zhou and Tulsiani [2023] Zhizhuo Zhou and Shubham Tulsiani. Sparsefusion: Distilling view-conditioned diffusion for 3d reconstruction. In CVPR, 2023.
Zhu et al. [2023] Zehao Zhu, Zhiwen Fan, Yifan Jiang, and Zhangyang Wang. Fsgs: Real-time few-shot view synthesis using gaussian splatting. arXiv preprint arXiv:2312.00451, 2023.

DNGaussian: Optimizing Sparse-View 3D Gaussian Radiance Fields with Global-Local Depth Normalization