Gaussian Splatting with Localized Points Management
Abstract
Point management is a critical component in optimizing 3D Gaussian Splatting (3DGS) models, as the point initiation (e.g., via structure from motion) is distributionally inappropriate. Typically, the Adaptive Density Control (ADC) algorithm is applied, leveraging view-averaged gradient magnitude thresholding for point densification, opacity thresholding for pruning, and regular all-points opacity reset. However, we reveal that this strategy is limited in tackling intricate/special image regions (e.g., transparent) as it is unable to identify all the 3D zones that require point densification, and lacking an appropriate mechanism to handle the ill-conditioned points with negative impacts (e.g., occlusion due to false high opacity). To address these limitations, we propose a Localized Point Management (LPM) strategy, capable of identifying those error-contributing zones in the highest demand for both point addition and geometry calibration. Zone identification is achieved by leveraging the underlying multiview geometry constraints, with the guidance of image rendering errors. We apply point densification in the identified zone, whilst resetting the opacity of those points residing in front of these regions so that a new opportunity is created to correct ill-conditioned points. Serving as a versatile plugin, LPM can be seamlessly integrated into existing 3D Gaussian Splatting models. Experimental evaluation across both static 3D and dynamic 4D scenes validate the efficacy of our LPM strategy in boosting a variety of existing 3DGS models both quantitatively and qualitatively. Notably, LPM improves both vanilla 3DGS and SpaceTimeGS to achieve state-of-the-art rendering quality while retaining real-time speeds, outperforming on challenging datasets such as Tanks & Temples and the Neural 3D Video Dataset.
1 Introduction
Neural rendering has emerged as a generalizable, flexible, and powerful approach for photorealistic novel view synthesis (NVS) of any camera poses [24], underpinning a wide variety of applications in augmented/virtual/mixed reality [9], robotics [39], and generation [26], among more others.
For example, taking a learning-based parametric idea, Neural Radiance Fields (NeRFs) [24] implicitly represent the scene radiance of varying complexity using neural networks (e.g., MLPs), without the tedious requirements of model design handcrafting for accounting the scene variations in geometry, texture, illumination. Despite the high-quality representational performances, they are typically inefficient computationally in view rendering due to heavy ray sampling, thus suffer in scaling to high-resolution content applications and large scale scene modeling [30, 33].
![Refer to caption](x1.png)
In this context, 3D Gaussian Splatting (3DGS) [13] has come as a more efficient alternative with much faster model optimization and real-time neural rendering. As an explicit representation model, this approach begins by initializing a set of 3D Gaussian points with Structure from Motion (SfM), followed by optimizing the parameters of these points via view reconstruction loss, with the view output produced using a differentiable splatting-based rasterization. As the point initialization is coarse and error-prone, in optimization, a point management mechanism, Adaptive Density Control (ADC), is typically applied for dealing with point distributional issues such as under-population (e.g., no enough points) or over-population (e.g., too many points) in the 3D space. However, we find several limitations with ADC: (1) Thresholding simply the average gradient for deciding the regions for point densification tends to overlook under-optimized points. For example, larger Gaussian points typically exhibit lower average gradients, frequently appearing across various views in screen space. (2) Point sparsity makes it difficult to add sufficient and reliable points to comprehensively cover the scene. (3) Falsely optimized Gaussian points could cause negative effects, e.g., occluding other good points and leading to incorrect depth estimates (see erroneous placements on windows Fig. 1).
To overcome the aforementioned limitations, in this paper we propose a novel Localized Point Management (LPM) approach. Our idea is intuitive – identifying those 3D Gaussian points leading to rendering errors. Thus we start with an image rendering error map of a specific view. To obtain the error contributing 3D points, we leverage the region correspondence between different views via feature map**, subject to the multiview geometry constraint. For each pair of corresponded regions, we cast the rays through them at their respective camera views in the cone shape, and consider their intersection as the error source zone. Within each such zone, we consider two situations: (1) At presence of points, we further apply point densification at a lower threshold to complement the original counterpart locally; (2) In case no point due to point sparsity, we add new Gaussian points. concurrently, we reset the opacity of points with high estimated opacity and residing in front of these zones, as they could impose high impact on view rendering. This provides an opportunity of correcting those potentially ill-conditioned points whilst tuning those newly added ones in the following optimization. To minimize model expansion, we prune the points by opacity in a density aware manner.
We summarize the contributions as follows: (1) With in-depth analysis, we have discovered that the standard point management mechanism used in 3DGS has several limitations that impede model optimization. (2) We present Localized Point Management (LPM) for these issues by identifying error-contributing 3D zones and implementing appropriate operations for point densification and opacity reset. (3) Extensive experiments validate the benefits of our LPM in improving a diversity of existing 3DGS models in novel view synthesis on both static and dynamic scenes.
2 Related Work
Neural Scene Representations has always been an important direction in novel view synthesis. Previous methods allocate neural features to structures such as volume [21, 28], texture [32], and point cloud [1]. The pioneering work of NeRF [24] proposes integrating neural networks with 3D volumetric representations to convert a 3D scene into a learnable density field, enabling high-quality novel view synthesis without requiring explicit modeling of the 3D scene and illumination. Later on, numerous works emerge to boost the quality and efficiency of volume rendering, [3, 37, 5] refine the point sampling strategy in ray marching, [4, 35] reparameterize the scene to produce a more compact representation. Additionally, regularization terms [8, 42] can be incorporated to constrain the scene representation, resulting in a closer approximation to real-world geometry. Despite their high-quality representational performance, these methods are typically computationally inefficient for view rendering due to the extensive ray sampling required and the use of Multi-Layer Perceptrons (MLPs) to represent the scene, complicating the computation and optimization of any point within the scene. To address this, several works have proposed novel scene representations aimed at accelerating the rendering process. These representations replace MLPs with sparse voxels [20], hash tables [25], or triplanes [6], significantly enhancing rendering speed. However, real-time rendering remains challenging due to the inherent complexity of the ray marching strategy in volume rendering.
3D Gaussian Splatting represents a recent advancement in novel view synthesis, enabling real-time high-quality rendering. It contributes to splatting-based rasterization by computing pixel colors through depth sorting and -blending of projected 2D Gaussians, thereby avoiding the complex sampling strategies of ray marching and achieving real-time performance. It is precisely due to its real-time high-quality rendering capabilities that 3DGS has been applied to various domains, including autonomous driving, content generation [31], and 4D dynamic scenes [18, 36, 40], among others. Despite these advancements, 3DGS still has some drawbacks, such as the storage of Gaussians and handling multi-resolution, and so on. Several works have enhanced 3DGS by improving Gaussian representation, including techniques such as low-pass filtering [41], multiscale Gaussian representations [38], and interpolating Gaussian attributes from structured grid features [22]. However, these works often overlook the importance of point management, specifically Adaptive Density Control, which is typically applied during optimization to address issues like under-population or over-population in the 3D space. Only a few works have focused on point management. For example, GaussianPro [7] directly tackles densification limitations, bridging gaps from SfM-based initialization. Pixel-GS [43] proposes a gradient scaling strategy to suppress artifacts near the camera. Additionally, [27] introduces an auxiliary per-pixel error function to implicitly supervise point contributions.
Although these methods improve densification, they are still unable to identify all the 3D zones that require point densification and lack a proper mechanism to handle ill-conditioned points with negative impacts. In our paper, we propose a novel approach, Localized Point Management, capable of identifying error-contributing zones with the highest demand for both point addition and geometry calibration.
3 Method
3.1 Preliminaries: 3D Gaussian Splatting
Gaussian Splatting builds upon concepts from EWA [44] splatting and proposes modeling a 3D scene as a collection of 3D Gaussian points , rendered through volume splatting. Each 3D Gaussian is defined by the equation:
where represents the mean vector, and denotes its covariance matrix. To maintain the positive semi-definite nature of during optimization, it is represented as , with the orthogonal rotation matrix and the diagonal scale matrix .
To render an image from a specific viewpoint, the color of each pixel is determined by blending ordered Gaussians that overlap , using the formula:
where is derived by evaluating a projected 2D Gaussian from at pixel combined with a learned opacity for , and is the learnable, view-dependent color modeled using spherical harmonics in 3DGS. Gaussians that influence are arranged in ascending order based on their depth from the current viewpoint. Employing differentiable rendering techniques allows for the end-to-end optimization of all Gaussian attributes through training view reconstruction.
Point management Since existing 3DGS variants
start by initializing 3D Gaussian points using Structure from Motion (SfM), the points are often coarse and error-prone. During optimization, a point management mechanism, Adaptive Density Control (ADC), is typically applied to manage point distribution issues.
Specifically, thresholding the average gradient is used to decide on point densification. For each Gaussian point , 3DGS tracks the magnitude of the positional gradient across all rendered views, which is then averaged to a quantity . During each training iteration, if the gradient surpasses a predefined threshold, it considers this point as inadequately representing the corresponding 3D region. With the scale of the Gaussian as the size measure, a large Gaussian will be split into two, while a small one leads to point cloning.
However, this commonly used ADC strategy is unable to identify all the 3D zones with the underlying need for point densification. This is becuase, often the local complexity of scene geometry varies significantly, which beyond the reach of any single-value based thresholding. Besides, there is lacking of a proper mechanism to handle ill-conditioned points with negative impacts (e.g., wrong opacity values estimated during training with points distributed here and there).
3.2 Localized Gaussian Point Management
To address the aforementioned issues, we introduce a novel model agnostic point management approach, Localized Point Management (LPM), which leverages multiview geometry constraints to identify error contributing 3D points, with the guidance of image rendering errors. This approach can be seamlessly integrated with existing 3DGS models without the need for architectural modification.
As illustrated in Figure 2, we begin with an image rendering error map for a specific view. Under the multiview geometry constraint, the corresponding regions in the referred view are matched via feature map**. For each pair of corresponding regions, we then cast rays through them from their respective camera views in a cone and identify their intersection as the error source zone. Within each zone, we perform localized point manipulation.
Error map generation
To accurately localize those zones in the 3D space that require point densification and geometry calibration, we initiate our process by rendering the current view image through the splatting of 3D Gaussians. This is followed by generating an error map (Figure 2(a)) for this specific view against the grounth-truth image using an error function [18].
![Refer to caption](x2.png)
Error contributing 3D zone identification
To project this rendering error back to the 3D space, we leverage the region correspondence between different views under multiview geometry constraints. This involves the following two key steps.
(i) Cross-view region map** We select a neighboring view as the referred image. Following LightGlue [19] that predicts a partial assignment between two sets of local features extracted from two view images and . Each feature consists of sets of 2D features position , normalized by the image size. The images and contain and local features. LightGlue outputs a set of correspondences . Since the 2D rendering error regions in the current view may not all appear in the referenced image, we select the paired region (Figure 2(b)) through the matching points. Additionally, this paired region undergoes multiview adaptive adjustments based on the error map throughout the optimization process.
(ii) 2D-to-3D projection After obtaining the paired regions with render errors, we project each 2D error region to the 3D space via multiview geometry constraints. Specifically, we cast the rays in cone shape for region from the camera’s center of projection along the direction , which aligns with the pixel’s center (Figure 2(c)). The apex of this cone is located at , and its radius at the image plane. Hence, is parameterized as . The radius is set to match the radius of the smallest circumscribed circle of the 2D plane error region, creating a cone on the 3D space that can trace the Gaussian points contributing to the 2D error region. Concurrently, a corresponding cone, denoted as , belong to region is similarly projected. Subsequently, we compute the intersection points of these rays. In order to regionalize these points, we directly use a smallest sphere that can contain these points as error source 3D zone .
Points manipulation
Recall that in existing 3DGS, points management only relies on the view-averaged gradient magnitude to determine point densification globally. In addition to this, we further perform localized points addition and geometry calibration within the identified error source 3D zone . For the point addition, we consider two common situations: (1) In the presence of points, we apply point densification to locally complement the original counterparts. We set a lower threshold to select the points that need densification, aiming to enhance the geometric details. The densification rule is consistent with 3DGS, but it focuses on local 3D zones that need it most. Specifically, for small Gaussians, our strategy involves cloning the Gaussians while maintaining their size and repositioning them along the positional gradient to better capture emerging geometrical features. Conversely, larger Gaussians situated in areas of high variance are split into smaller points to more accurately represent the underlying geometry. (2) In cases of point sparsity, we add new Gaussian points at the center of the 3D zone.
In the context of -blending in 3DGS, if the points at the forefront of the identified 3D zone have the highest opacity, they may occlude valid points, leading to incorrect depth estimation, as shown in Figure 1. To deal with such issues, we treat these points as potentially ill-conditioned. We reset these points to provide an opportunity for correction, further calibrating the geometry.
To minimize model expansion, we adaptively prune points based on their opacity values, starting from low to high. The number of points pruned is determined by the density of points in the zone. This strategic reduction ensures that our point management remains cost efficient and adaptive to the evolving needs of the scene representation.
4 Experiment
Datasets and metrics
We conducted an extensive evaluation using both static and dynamic scenes derived from publicly datasets. For static scenes, our approach was applied to a total of 11 scenes as specified in the 3DGS framework [13], which includes nine scenes from Mip-NeRF360 [3], two from Tanks&Temples [14], and two from DeepBlending [12]. In the context of dynamic scenes, our approach was tested across six scenes from the Neural 3D Video Dataset [16].
To evaluate novel view synthesis performance, we follow standard protocols by selecting one out of every eight images as test images, with the remaining used for training in static scenes. For each dynamic scene within the Neural 3D Video Dataset, one view was designated for testing while the others were allocated for training purposes. Evaluation metrics included the peak signal-to-noise ratio (PSNR), structural similarity index measure (SSIM), and the learned perceptual image patch similarity (LPIPS), which are broadly recognized standards in the field.
Baseline and Implementation
Vanilla 3D Gaussian Splatting(3DGS) [13] and its variants SpacetimeGS(STGS) [18] are selected as our main baseline for its established SOTA performance in novel view synthesis. For the Static 3D benchmark, We also record the results of Mip-NeRF360 [3], iNGP [25] and Plenoxels [11] as in [13]. For the Dynamic 4D benchmark, we also performed system comparison, such as DyNeRF [17], StreamRF [15], K-planes [10] and so on. In alignment with the approach described in 3DGS an STGS, our models are trained for 30k iterations across all scenes, following the same training schedule and hyperparameters. In addition to the original Gaussian densification strategies used in 3DGS and SpaceTime Gaussian, we also perform localized points management, including addition, reset, and pruning. We maintain the same thresholds for splitting and cloning points as in the original 3DGS and SpaceTime Gaussian. For point matching, we perform offline extraction to save computational cost. All experiments were conducted on an RTX 3090 GPU with 24GB of memory.
![Refer to caption](x3.png)
4.1 Main Results
Results on static 3D datasets
Method | Mip-NeRF 360 | Tanks&Temples | Deep Blending | ||||||
---|---|---|---|---|---|---|---|---|---|
PSNR | SSIM | LPIPS | PSNR | SSIM | LPIPS | PSNR | SSIM | LPIPS | |
Plenoxels [11] | 23.08 | 0.625 | 0.463 | 21.08 | 0.719 | 0.379 | 23.06 | 0.795 | 0.510 |
INGP-Big [25] | 25.59 | 0.699 | 0.331 | 21.92 | 0.745 | 0.305 | 24.96 | 0.817 | 0.390 |
Mip-NeRF 360 [3] | 27.69 | 0.792 | 0.237 | 22.22 | 0.759 | 0.257 | 29.40 | 0.901 | 0.245 |
3DGS [13] | 27.21 | 0.815 | 0.214 | 23.14 | 0.841 | 0.183 | 29.41 | 0.903 | 0.243 |
3DGS* | 27.47 | 0.816 | 0.216 | 23.67 | 0.849 | 0.177 | 29.55 | 0.904 | 0.245 |
3DGS + LPM | 27.59 | 0.820 | 0.216 | 23.83 | 0.850 | 0.181 | 29.76 | 0.908 | 0.241 |
The quantitative results (PSNR, SSIM, and LPIPS) on the Mip-NeRF 360 and Tanks & Temples datasets are presented in Tables 1. We retrained the 3DGS model (referred to as 3DGS*) as it yields better performance compared to the original 3DGS. Our approach achieves results comparable to the state-of-the-art on the Mip-NeRF360 dataset and further enhances 3DGS using our point management technique. Additionally, LPM improve 3DGS to set new state-of-the-art results on the Tanks & Temples and DeepBlending datasets, effectively capturing more challenging environments (e.g., light effects, transparency). These results quantitatively validate the effectiveness of our method in improving the quality of reconstruction.
In Figures 3, we present a comparison between 3DGS [13] and the 3DGS+LPM. A variety of improvements can be observed, particularly in challenging cases such as light effects, completeness at a distance, intricate details, and transparency. Our LPM significantly reduces artifacts in specific regions on top of 3DGS, particularly in the tree and flowers in the second and third rows. These regions require more points for accurate population, leading to a more precise and detailed reconstruction. Additionally, the tablecloth and window regions in the first and last rows are affected by ill-conditioned points. Our geometry calibration with LPM provides an opportunity to correct these potentially ill-conditioned points, enhancing the overall reconstruction accuracy.
Results on dynamic 4D datasets
![Refer to caption](x4.png)
Method | PSNR | DSSIM1 | DSSIM2 | LPIPS | FPS |
LLFF [23] 1 | 23.24 | - | 0.076 | 0.235 | - |
DyNeRF [17] 1 | 29.58 | 0.020 | 0.083 | 0.063 | 0.015 |
Dynamic-4DGS [36] 1 | - | - | - | - | 30 |
4DGS [40] 1 | 29.38 | - | - | - | 114 |
STGS [18] 1 | 29.58 | 0.038 | 0.022 | 0.063 | 103 |
STGS* 1 | 29.48 | 0.038 | 0.023 | 0.066 | 110 |
STGS 1 + LPM | 29.84 | 0.036 | 0.022 | 0.062 | 105 |
StreamRF [15] | 28.26 | - | - | 0.039 | 10.9 |
NeRFPlayer [29] | 30.69 | 0.034 | - | 0.111 | 0.05 |
HyperReal [2] | 31.10 | 0.036 | - | 0.096 | 2 |
K-planes [10] | 31.63 | 0.018 | - | 0.31 | 3 |
MixVoxels-X [34] | 31.73 | 0.015 | - | 0.064 | 4.6 |
Dynamic-4DGS [36] | 31.15 | - | 0.016 | 0.049 | 30 |
4DGS [40] | 32.01 | - | - | 0.055 | 114 |
STGS [18] | 32.05 | 0.026 | 0.014 | 0.044 | 140 |
STGS* | 31.99 | 0.026 | 0.015 | 0.045 | 145 |
STGS+ LPM | 32.40 | 0.025 | 0.014 | 0.045 | 140 |
Table 2 presents a quantitative evaluation on the Neural 3D Video Dataset. Following established practices, training and evaluation are conducted at half resolution, with the first camera held out for evaluation [17]. Integrating our LPM into SpaceTimeGS yields the best performance across all comparisons. Notably, our method demonstrates significant improvements in the challenging Flame Salmon scene compared to STGS [18]. Our approach not only surpasses previous methods in rendering quality but also maintains comparable rendering speed.
In addition to the quantitative assessment, we provide qualitative comparisons on the Flame Salmon and Flame Steak scenes, as illustrated in Figure 4. The quality of synthesis in both static and dynamic regions markedly outperforms STGS. Several intricate details, including the tree behind the window and the fine features like the dog’s tongue, are faithfully reproduced with higher accuracy compared to STGS [18]. Both examples indicate that LPM improves upon STGS for superior scene modeling.
4.2 Ablation study
![Refer to caption](x5.png)
We conducted ablation studies on the more challenging scene: PlayRoom from Deep Blending [12] and Truck from Tanks&Temples [14].
Effectiveness and efficiency of the LPM
We hypothesize that the Adaptive Density Control (ADC) tends to overlook under-optimized points due to its simplistic approach of thresholding the average gradient. The straight way to identify the all points is lowering threshold to densification process. Although this solution can reduce blurring in specific regions, such as the toy (red box) illustrated in Figure 5, it still has limitations. As shown in Table 3, lowering the threshold for 3DGS significantly increases the number of Gaussian points and decreases rendering speed. Additionally, the PSNR of the quantitative results decreases due to the introduction of unnecessary points in already dense areas. In contrast, LPM effectively generates points in areas indicated by the error map, leading to more accurate and detailed reconstructions while maintaining real-time rendering speed. As demonstrated by the qualitative comparison in Figure 5, 3DGS with LPM achieves superior qualitative results.
Scene | Method | PSNR | LPIPS | Gaussians | Training time |
---|---|---|---|---|---|
3DGS* [13] | 30.03 | 0.244 | 232k | 22min | |
PlayRoom | 3DG* (lower threshold) | 29.69 | 0.240 | 523k | 36min |
3DGS + LPM | 30.22 | 0.241 | 186k | 23min | |
3DGS* [13] | 25.42 | 0.146 | 257k | 19 min | |
Truck | 3DGS* (lower threshold) | 25.45 | 0.127 | 635k | 35min |
3DGS + LPM | 25.61 | 0.154 | 265k | 21min |
Individual points manipulation
We study the effect of individual points manipulation of LPM, including the point addition and reset ill-conditional points. The results in Table 4 show that, (1) each manipulation is useful with positive gain, suggesting that the LPM is meaningful. (2) The point addition operation densify the under-optimized points which may be overlook in the 3DGS , further captures the geometry details (e.g., detail of toy and leaf of the tree, see Fig. 5). (3) Reset points in ceratin zone provide the opportunity of correct the ill-conditioned points to achieve geometry calibration, (e.g., window of the trunk, see Fig. 5).
PlayRoom | Truck | |||||
---|---|---|---|---|---|---|
Method | PSNR | LPIPS | SSIM | PSNR | LPIPS | SSIM |
Full LPM | 30.22 | 0.241 | 0.910 | 25.61 | 0.154 | 0.883 |
wo/ point addition | 30.10 | 0.241 | 0.910 | 25.43 | 0.153 | 0.883 |
wo/ reset | 30.07 | 0.243 | 0.908 | 25.52 | 0.144 | 0.883 |
Robustness to sparse training images
We conducted further ablation studies to verify the impact of the number of training images. In Table 5, we present the results of training 3DGS and our method using randomly selected subsets comprising 25%, 50%, 75%, and 100% of the training images. Remarkably, our method consistently achieves superior rendering results compared to 3DGS across different percentages of training images.
Scene | Method | 25% | 50% | 75% | 100% | ||||
---|---|---|---|---|---|---|---|---|---|
PSNR | LPIPS | PSNR | LPIPS | PSNR | LPIPS | PSNR | LPIPS | ||
PlayRoom | 3DGS [13] | 25.33 | 0.313 | 27.37 | 0.270 | 29.16 | 0.253 | 30.03 | 0.244 |
3DGS+ LPM | 25.43 | 0.313 | 27.42 | 0.267 | 29.06 | 0.252 | 30.22 | 0.241 | |
Trunk | 3DGS [13] | 22.46 | 0.177 | 24.15 | 0.154 | 24.86 | 0.150 | 25.42 | 0.146 |
3DGS + LPM | 22.95 | 0.173 | 24.55 | 0.157 | 25.14 | 0.152 | 25.61 | 0.154 |
5 Conclusions and limitations
We propose Localized Point Management (LPM), a novel point management approach to address the limitations of the Adaptive Density Control (ADC) mechanism in 3D Gaussian Splatting (3DGS). The core idea of LPM is identifying the error-contributing 3D zones that require both point addition and geometry calibration under multiview geometry constraints, guided by image rendering errors. We implement appropriate operations for point densification and opacity reset. As a versatile plugin, LPM can be seamlessly integrated into existing 3DGS-based rendering methods. Extensive experiments across both static 3D and dynamic 4D scenes validate the efficacy of LPM in enhancing existing ADC mechanisms both quantitatively and qualitatively. While our method identifies the 3D Gaussian points that lead to rendering errors, it still follows the densification rules of 3DGS [13]. This approach may not be optimal for under-optimized points, and we leave this aspect for further investigation.
References
- [1] K.-A. Aliev, A. Sevastopolsky, M. Kolos, D. Ulyanov, and V. Lempitsky. Neural point-based graphics. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXII 16, pages 696–712. Springer, 2020.
- [2] B. Attal, J.-B. Huang, C. Richardt, M. Zollhöfer, J. Kopf, M. O’Toole, and C. Kim. Hyperreel: High-fidelity 6-dof video with ray-conditioned sampling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 16610–16620, June 2023.
- [3] J. T. Barron, B. Mildenhall, M. Tancik, P. Hedman, R. Martin-Brualla, and P. P. Srinivasan. Mip-nerf: A multiscale representation for anti-aliasing neural radiance fields. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5855–5864, 2021.
- [4] J. T. Barron, B. Mildenhall, D. Verbin, P. P. Srinivasan, and P. Hedman. Mip-nerf 360: Unbounded anti-aliased neural radiance fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5470–5479, 2022.
- [5] J. T. Barron, B. Mildenhall, D. Verbin, P. P. Srinivasan, and P. Hedman. Zip-nerf: Anti-aliased grid-based neural radiance fields. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 19697–19705, 2023.
- [6] A. Chen, Z. Xu, A. Geiger, J. Yu, and H. Su. Tensorf: Tensorial radiance fields. In European Conference on Computer Vision, pages 333–350. Springer, 2022.
- [7] K. Cheng, X. Long, K. Yang, Y. Yao, W. Yin, Y. Ma, W. Wang, and X. Chen. Gaussianpro: 3d gaussian splatting with progressive propagation. arXiv preprint arXiv:2402.14650, 2024.
- [8] K. Deng, A. Liu, J.-Y. Zhu, and D. Ramanan. Depth-supervised nerf: Fewer views and faster training for free. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12882–12891, 2022.
- [9] N. Deng, Z. He, J. Ye, B. Duinkharjav, P. Chakravarthula, X. Yang, and Q. Sun. Fov-nerf: Foveated neural radiance fields for virtual reality. IEEE Transactions on Visualization and Computer Graphics, 28(11):3854–3864, 2022.
- [10] S. Fridovich-Keil, G. Meanti, F. R. Warburg, B. Recht, and A. Kanazawa. K-planes: Explicit radiance fields in space, time, and appearance. In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12479–12488, 2023.
- [11] S. Fridovich-Keil, A. Yu, M. Tancik, Q. Chen, B. Recht, and A. Kanazawa. Plenoxels: Radiance fields without neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5501–5510, 2022.
- [12] P. Hedman, J. Philip, T. Price, J.-M. Frahm, G. Drettakis, and G. Brostow. Deep blending for free-viewpoint image-based rendering. ACM Transactions on Graphics (ToG), 37(6):1–15, 2018.
- [13] B. Kerbl, G. Kopanas, T. Leimkühler, and G. Drettakis. 3d gaussian splatting for real-time radiance field rendering. ACM Transactions on Graphics, 42(4):1–14, 2023.
- [14] A. Knapitsch, J. Park, Q.-Y. Zhou, and V. Koltun. Tanks and temples: Benchmarking large-scale scene reconstruction. ACM Transactions on Graphics (ToG), 36(4):1–13, 2017.
- [15] L. Li, Z. Shen, Z. Wang, L. Shen, and P. Tan. Streaming radiance fields for 3d video synthesis. In Advances in Neural Information Processing Systems, 2022.
- [16] T. Li, M. Slavcheva, M. Zollhoefer, S. Green, C. Lassner, C. Kim, T. Schmidt, S. Lovegrove, M. Goesele, R. Newcombe, et al. Neural 3d video synthesis from multi-view video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5521–5531, 2022.
- [17] T. Li, M. Slavcheva, M. Zollhoefer, S. Green, C. Lassner, C. Kim, T. Schmidt, S. Lovegrove, M. Goesele, R. Newcombe, and Z. Lv. Neural 3d video synthesis from multi-view video. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5511–5521, 2022.
- [18] Z. Li, Z. Chen, Z. Li, and Y. Xu. Spacetime gaussian feature splatting for real-time dynamic view synthesis. arXiv preprint arXiv:2312.16812, 2023.
- [19] P. Lindenberger, P.-E. Sarlin, and M. Pollefeys. Lightglue: Local feature matching at light speed. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 17627–17638, 2023.
- [20] L. Liu, J. Gu, K. Zaw Lin, T.-S. Chua, and C. Theobalt. Neural sparse voxel fields. Advances in Neural Information Processing Systems, 33:15651–15663, 2020.
- [21] S. Lombardi, T. Simon, J. Saragih, G. Schwartz, A. Lehrmann, and Y. Sheikh. Neural volumes: Learning dynamic renderable volumes from images. arXiv preprint arXiv:1906.07751, 2019.
- [22] T. Lu, M. Yu, L. Xu, Y. Xiangli, L. Wang, D. Lin, and B. Dai. Scaffold-gs: Structured 3d gaussians for view-adaptive rendering. arXiv preprint arXiv:2312.00109, 2023.
- [23] B. Mildenhall, P. P. Srinivasan, R. Ortiz-Cayon, N. K. Kalantari, R. Ramamoorthi, R. Ng, and A. Kar. Local light field fusion: practical view synthesis with prescriptive sampling guidelines. ACM Trans. Graph., 38(4), jul 2019.
- [24] B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng. Nerf: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM, 65(1):99–106, 2021.
- [25] T. Müller, A. Evans, C. Schied, and A. Keller. Instant neural graphics primitives with a multiresolution hash encoding. ACM transactions on graphics (TOG), 41(4):1–15, 2022.
- [26] B. Poole, A. Jain, J. T. Barron, and B. Mildenhall. Dreamfusion: Text-to-3d using 2d diffusion. arXiv preprint arXiv:2209.14988, 2022.
- [27] S. Rota Bulò, L. Porzi, and P. Kontschieder. Revising densification in gaussian splatting. arXiv e-prints, pages arXiv–2404, 2024.
- [28] V. Sitzmann, J. Thies, F. Heide, M. Nießner, G. Wetzstein, and M. Zollhofer. Deepvoxels: Learning persistent 3d feature embeddings. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2437–2446, 2019.
- [29] L. Song, A. Chen, Z. Li, Z. Chen, L. Chen, J. Yuan, Y. Xu, and A. Geiger. Nerfplayer: A streamable dynamic scene representation with decomposed neural radiance fields. IEEE Transactions on Visualization and Computer Graphics, 29(5):2732–2742, 2023.
- [30] M. Tancik, V. Casser, X. Yan, S. Pradhan, B. Mildenhall, P. P. Srinivasan, J. T. Barron, and H. Kretzschmar. Block-nerf: Scalable large scene neural view synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8248–8258, 2022.
- [31] J. Tang, J. Ren, H. Zhou, Z. Liu, and G. Zeng. Dreamgaussian: Generative gaussian splatting for efficient 3d content creation. arXiv preprint arXiv:2309.16653, 2023.
- [32] J. Thies, M. Zollhöfer, and M. Nießner. Deferred neural rendering: Image synthesis using neural textures. Acm Transactions on Graphics (TOG), 38(4):1–12, 2019.
- [33] H. Turki, D. Ramanan, and M. Satyanarayanan. Mega-nerf: Scalable construction of large-scale nerfs for virtual fly-throughs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12922–12931, 2022.
- [34] F. Wang, S. Tan, X. Li, Z. Tian, Y. Song, and H. Liu. Mixed neural voxels for fast multi-view video synthesis. In 2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 19649–19659, 2023.
- [35] P. Wang, Y. Liu, Z. Chen, L. Liu, Z. Liu, T. Komura, C. Theobalt, and W. Wang. F2-nerf: Fast neural radiance field training with free camera trajectories. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4150–4159, 2023.
- [36] G. Wu, T. Yi, J. Fang, L. Xie, X. Zhang, W. Wei, W. Liu, Q. Tian, and X. Wang. 4d gaussian splatting for real-time dynamic scene rendering. arXiv preprint arXiv:2310.08528, 2023.
- [37] Q. Xu, Z. Xu, J. Philip, S. Bi, Z. Shu, K. Sunkavalli, and U. Neumann. Point-nerf: Point-based neural radiance fields. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5438–5448, 2022.
- [38] Z. Yan, W. F. Low, Y. Chen, and G. H. Lee. Multi-scale 3d gaussian splatting for anti-aliased rendering. arXiv preprint arXiv:2311.17089, 2023.
- [39] Z. Yang, Y. Chen, J. Wang, S. Manivasagam, W.-C. Ma, A. J. Yang, and R. Urtasun. Unisim: A neural closed-loop sensor simulator. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1389–1399, 2023.
- [40] Z. Yang, H. Yang, Z. Pan, X. Zhu, and L. Zhang. Real-time photorealistic dynamic scene representation and rendering with 4d gaussian splatting. In ICLR, 2024.
- [41] Z. Yu, A. Chen, B. Huang, T. Sattler, and A. Geiger. Mip-splatting: Alias-free 3d gaussian splatting. arXiv preprint arXiv:2311.16493, 2023.
- [42] Z. Yu, S. Peng, M. Niemeyer, T. Sattler, and A. Geiger. Monosdf: Exploring monocular geometric cues for neural implicit surface reconstruction. Advances in neural information processing systems, 35:25018–25032, 2022.
- [43] Z. Zhang, W. Hu, Y. Lao, T. He, and H. Zhao. Pixel-gs: Density control with pixel-aware gradient for 3d gaussian splatting. arXiv preprint arXiv:2403.15530, 2024.
- [44] M. Zwicker, H. Pfister, J. Van Baar, and M. Gross. Ewa volume splatting. In Proceedings Visualization, 2001. VIS’01., pages 29–538. IEEE, 2001.
Appendix A Appendix
A.1 Additional Results
Per-scene Result of Static 3D
We provide additional quantitative results for all three datasets in the tables referenced. Tables 6, 7, 8, 9, 10, and 11 present the metrics for each scene in the Mip-NeRF360 [3], Tanks&Temples [14], and DeepBlending [12] datasets. Our method consistently improve 3DGS [13] scene modeling in the vast majority of scenarios.
Bicycle | Flowers | Garden | Stump | Treehill | Room | |
---|---|---|---|---|---|---|
Plenoxels [11] | 21.912 | 20.097 | 23.4947 | 20.661 | 22.487 | 27.594 |
INGP-Big [25] | 22.171 | 20.652 | 25.069 | 23.466 | 22.373 | 29.690 |
Mip-NeRF 360 [3] | 24.37 | 21.73 | 26.98 | 26.40 | 22.87 | 31.63 |
3DGS [13] | 25.246 | 21.520 | 27.410 | 26.550 | 22.490 | 30.632 |
3DGS* | 25.166 | 21.576 | 27.388 | 26.637 | 22.487 | 31.53 |
3DGS + LPM | 25.4 | 21.73 | 27.43 | 26.81 | 22.78 | 31.58 |
Counter | Kitchen | Bonsai | Dr Johnson | Playroom | Truck | Train | |
---|---|---|---|---|---|---|---|
Plenoxels [11] | 23.624 | 23.420 | 24.669 | 23.142 | 22.980 | 23.221 | 18.927 |
INGP-Big [25] | 26.691 | 29.479 | 30.685 | 28.257 | 21.665 | 23.383 | 20.456 |
Mip-NeRF 360 [3] | 29.55 | 32.23 | 33.46 | 29.140 | 29.657 | 24.912 | 19.523 |
3DGS [13] | 28.700 | 30.317 | 31.980 | 28.766 | 30.044 | 25.187 | 21.097 |
3DGS* | 28.90 | 31.43 | 32.14 | 29.08 | 30.03 | 25.42 | 21.91 |
3DGS + LPM | 28.91 | 31.45 | 32.20 | 29.30 | 30.22 | 25.61 | 22.05 |
Bicycle | Flowers | Garden | Stump | Treehill | Room | |
---|---|---|---|---|---|---|
Plenoxels [11] | 0.506 | 0.521 | 0.3864 | 0.503 | 0.540 | 0.4186 |
INGP-Big [25] | 0.446 | 0.441 | 0.257 | 0.421 | 0.450 | 0.261 |
Mip-NeRF 360 [3] | 0.301 | 0.344 | 0.170 | 0.261 | 0.339 | 0.211 |
3DGS [13] | 0.205 | 0.336 | 0.103 | 0.210 | 0.317 | 0.220 |
3DGS* | 0.211 | 0.336 | 0.107 | 0.215 | 0.324 | 0.218 |
3DGS + LPM | 0.203 | 0.337 | 0.108 | 0.224 | 0.347 | 0.209 |
Counter | Kitchen | Bonsai | Dr Johnson | Playroom | Truck | Train | |
---|---|---|---|---|---|---|---|
Plenoxels [11] | 0.441 | 0.447 | 0.398 | 0.521 | 0.499 | 0.335 | 0.422 |
INGP-Big [25] | 0.306 | 0.195 | 0.205 | 0.352 | 0.428 | 0.249 | 0.360 |
Mip-NeRF 360 [3] | 0.204 | 0.127 | 0.176 | 0.237 | 0.252 | 0.159 | 0.354 |
3DGS [13] | 0.204 | 0.129 | 0.205 | 0.244 | 0.241 | 0.148 | 0.218 |
3DGS* | 0.200 | 0.126 | 0.204 | 0.245 | 0.244 | 0.146 | 0.207 |
3DGS + LPM | 0.200 | 0.125 | 0.202 | 0.241 | 0.241 | 0.154 | 0.209 |
Bicycle | Flowers | Garden | Stump | Treehill | Room | |
---|---|---|---|---|---|---|
Plenoxels [11] | 0.496 | 0.431 | 0.6063 | 0.523 | 0.509 | 0.8417 |
INGP-Big [25] | 0.512 | 0.486 | 0.701 | 0.594 | 0.542 | 0.871 |
Mip-NeRF 360 [3] | 0.685 | 0.583 | 0.813 | 0.744 | 0.632 | 0.913 |
3DGS [13] | 0.771 | 0.605 | 0.868 | 0.775 | 0.638 | 0.914 |
3DGS* | 0.765 | 0.606 | 0.867 | 0.773 | 0.634 | 0.920 |
3DGS + LPM | 0.776 | 0.609 | 0.870 | 0.781 | 0.636 | 0.923 |
Counter | Kitchen | Bonsai | Dr Johnson | Playroom | Truck | Train | |
---|---|---|---|---|---|---|---|
Plenoxels [11] | 0.759 | 0.648 | 0.814 | 0.787 | 0.802 | 0.774 | 0.663 |
INGP-Big [25] | 0.817 | 0.858 | 0.906 | 0.854 | 0.779 | 0.800 | 0.689 |
Mip-NeRF 360 [3] | 0.894 | 0.920 | 0.941 | 0.901 | 0.900 | 0.857 | 0.660 |
3DGS [13] | 0.905 | 0.922 | 0.938 | 0.899 | 0.906 | 0.879 | 0.802 |
3DGS* | 0.908 | 0.927 | 0.942 | 0.901 | 0.907 | 0.882 | 0.815 |
3DGS + LPM | 0.909 | 0.929 | 0.943 | 0.905 | 0.910 | 0.883 | 0.817 |
Per-scene Result of Dynamic 4D
In Table 12, we provide the PSNR on different scenes. The quanlitative results clearly show that LPM improve STGS [18] to faithfully capture the subtle static and dynamic information.
Coffee | Spinach | Beef | Salmon | Steak | Sear | |
---|---|---|---|---|---|---|
Martini | Cut | Flame | Flame | Steak | ||
K-Planes-explicit [10] | 28.74 | 32.19 | 31.93 | 28.71 | 31.80 | 31.89 |
K-Planes-hybrid [10] | 29.99 | 32.60 | 31.82 | 30.44 | 32.38 | 32.52 |
MixVoxels [34] | 29.36 | 31.61 | 31.30 | 29.92 | 31.21 | 31.43 |
NeRFPlayer [29] | 31.53 | 30.56 | 29.35 | 31.65 | 31.93 | 29.12 |
HyperReel [2] | 28.37 | 32.30 | 32.92 | 28.26 | 32.20 | 32.57 |
Dynamic-4D [36] | 27.34 | 32.46 | 32.90 | 29.20 | 32.51 | 32.49 |
4DGS [40] | 28.33 | 32.93 | 33.85 | 29.38 | 34.03 | 33.51 |
STGS [18] | 28.61 | 33.18 | 33.52 | 29.48 | 33.64 | 33.89 |
STGS* | 28.48 | 33.05 | 33.40 | 29.48 | 33.74 | 33.80 |
STGS+LPM | 28.93 | 33.27 | 33.90 | 29.84 | 34.26 | 34.20 |
Appendix B More visualizations
Figure 6 provides more examples on static 3D and dynamic 4D dataset.
![Refer to caption](x6.png)