Refer to caption
Figure 1. 3D Gaussian Splatting (Kerbl et al., 2023) suffers from pop** artifacts during view rotation due to its approximate, global sorting scheme. Our method is able to effectively circumvent short-range pop** artifacts (left) and long-range view-inconsistencies (right) during rotation with a novel, hierarchical per-pixel sorting strategy.
\Description

This is the teaser figure for the article.

StopThePop: Sorted Gaussian Splatting for View-Consistent Real-time Rendering

Lukas Radl [email protected] Michael Steiner [email protected] Graz University of TechnologyAustria Mathias Parger Huawei TechnologiesAustria [email protected] Alexander Weinrauch Graz University of TechnologyAustria [email protected] Bernhard Kerbl TU WienAustria [email protected]  and  Markus Steinberger Graz University of TechnologyAustria Huawei TechnologiesAustria [email protected]
Abstract.

Gaussian Splatting has emerged as a prominent model for constructing 3D representations from images across diverse domains. However, the efficiency of the 3D Gaussian Splatting rendering pipeline relies on several simplifications. Notably, reducing Gaussian to 2D splats with a single view-space depth introduces pop** and blending artifacts during view rotation. Addressing this issue requires accurate per-pixel depth computation, yet a full per-pixel sort proves excessively costly compared to a global sort operation. In this paper, we present a novel hierarchical rasterization approach that systematically resorts and culls splats with minimal processing overhead. Our software rasterizer effectively eliminates pop** artifacts and view inconsistencies, as demonstrated through both quantitative and qualitative measurements. Simultaneously, our method mitigates the potential for cheating view-dependent effects with pop**, ensuring a more authentic representation. Despite the elimination of cheating, our approach achieves comparable quantitative results for test images, while increasing the consistency for novel view synthesis in motion. Due to its design, our hierarchical approach is only 4%percent44\%4 % slower on average than the original Gaussian Splatting. Notably, enforcing consistency enables a reduction in the number of Gaussians by approximately half with nearly identical quality and view-consistency. Consequently, rendering performance is nearly doubled, making our approach 1.6x faster than the original Gaussian Splatting, with a 50% reduction in memory requirements. Our renderer is publicly available at https://github.com/r4dl/StopThePop.

Parallel Computing, Point-based Rendering, Real-Time Rendering
copyright: rightsretainedjournal: TOGjournalyear: 2024journalvolume: 43journalnumber: 4article: 64publicationmonth: 7doi: 10.1145/3658187ccs: Computing methodologies Rasterization

1. Introduction

In recent years, Neural Radiance Fields (NeRFs) (Mildenhall et al., 2020) have triggered a new surge of research around differentiable rendering of 3D representations. Leveraging the traditional volume rendering equation, NeRFs are fully differentiable, enabling continuous optimization to align the representation to diverse input views and support high-quality novel view synthesis. This differentiability also proves valuable in addressing other rendering challenges that necessitate gradient flow and optimization.

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Figure 2. Effect of collapsing 3D Gaussians into 2D splats and 3DGS’s depth simplification: (a) Integrating Gaussians along view rays 𝐫𝐫\mathbf{r}bold_r requires careful consideration of potentially overlap** 1D Gaussians. (b) Using flattened 2D splats and view-space z𝑧zitalic_z as depth (projection of μ𝜇\mathbf{\mu}italic_μ onto 𝐯𝐯\mathbf{v}bold_v) puts 2D splats on spherical segments around the camera, inverting the relative positions of the two Gaussians along the example view ray. (c) Camera rotation inverts the order along 𝐫𝐫\mathbf{r}bold_r, resulting in pop**. (d) Camera translation does not alter the distance compared to (b).

Various strategies have arisen to tackle challenges in NeRFs, particularly mitigating the computational costs linked to multilayer perceptron (MLP) evaluation. These approaches include adopting direct voxel representations (Fridovich-Keil et al., 2022), employing feature hash maps (Müller et al., 2022), and exploring tensor factorizations (Chen et al., 2022; Tang et al., 2022)—departing to some extent from the original pure MLP design. A recent notable development in this trajectory is 3D Gaussian Splatting (3DGS) (Kerbl et al., 2023), which renders oriented 3D Gaussians with spherical harmonics (SH) as a view-dependent color representation.

Remaining faithful to the traditional volume rendering equation, 3DGS facilitates gradient flows from image errors to the Gaussians’ positions, shapes, densities, and colors. With an initialization based on structure-from-motion (Snavely et al., 2006), a real-time compute-mode rasterizer, and heuristic-driven densification and sparsification, 3DGS converges to a high-quality representation with compact memory requirements. Consequently, 3DGS has firmly established itself as one of the most widely used methods for 3D scene reconstruction and differentiable rendering. Colored, semi-transparent 3D Gaussians serve as a versatile representation, but their accurate rendering is challenging. Although the projection of a 3D Gaussian onto a view ray is straightforward, leveraging synergies between neighboring rays under perspective projection proves intricate. Hence, 3DGS approximates them as flattened 2D splats (Zwicker et al., 2002), necessitating depth-based sorting for rendering. 3DGS further simplifies this step by sorting based on the view-space z𝑧zitalic_z-coordinate of each Gaussian’s mean, effectively projecting splats onto spherical shells reminiscent of Broxton et al. (2020). While this global sorting eases the rendering algorithm, it introduces pop** artifacts, i.e., sudden color changes for consistent geometry, during camera rotations due to changes in the relative depth of shells (see Fig. 2). Such view inconsistencies due to pop** can be very irritating and immersion-breaking, e.g. during head rotation in a virtual reality setting.

Fully evaluating all Gaussians in 3D along each view ray while considering their overlap would be ideal, but likely not feasible in real-time. The next best solution involves approximating the location where each Gaussian contributes the most for each view ray, i.e., determining its depth, followed by a correct per-pixel blending. Sorting must now happen for each view ray, rather than globally for all Gaussians; an obvious challenge as it is not uncommon to see thousands of Gaussians be considered for individual rays in 3DGS. To solve this challenge, we propose a novel 3D Gaussian Splatting rendering pipeline that exploits coherence among neighboring view rays on multiple hierarchy levels, interleaving culling, depth evaluation and resorting. We make the following contributions:

  • A novel hierarchical 3D Gaussian Splatting renderer that leads to per-pixel sorting of Gaussian splats for both the forward and backward pass of the 3DGS rendering pipeline and thus removes pop** artifacts.

  • An in-depth analysis of culling and depth approximation strategies, as well as pipeline optimizations and workload distribution schemes for our compute-mode 3DGS hierarchical renderer.

  • A discussion and evaluation of various sorting strategies of Gaussian splats and their influence on overall rendering quality and view-consistency.

  • An effective automatic method to detect pop** artifacts in videos captured from trained 3D Gaussians as well as a user study confirming the results of the presented method.

Our results indicate that a full per-pixel sorted renderer for Gaussian splats eliminates all pop** artifacts but reduces rendering speed by 100×100\times100 ×. Our hierarchical renderer is virtually indistinguishable from a full per-pixel sorted renderer, but only adds an overhead of 4%percent44\%4 % compared to the original 3DGS.

2. Preliminaries and Related Work

In the following, we review the renderer used in 3DGS. For a complete description of the approach, cf. Kerbl et al. (2023).

2.1. 3D Gaussian Splatting

NeRF-style rendering and 3DGS use the volume rendering equation:

(1) C(𝐫)=0t𝐜(𝐫,t)σ(𝐫,t)T(𝐫,t)𝑑t,   where𝐶𝐫superscriptsubscript0𝑡𝐜𝐫𝑡𝜎𝐫𝑡𝑇𝐫𝑡differential-d𝑡,   whereC(\mathbf{r})=\int_{0}^{t}\mathbf{c}\left(\mathbf{r},t\right)\,\sigma\left(% \mathbf{r},t\right)\,T(\mathbf{r},t)\,dt\text{, \hskip 5.0ptwhere}italic_C ( bold_r ) = ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT bold_c ( bold_r , italic_t ) italic_σ ( bold_r , italic_t ) italic_T ( bold_r , italic_t ) italic_d italic_t , where
T(𝐫,t)=e0tσ(𝐫,s)𝑑s,𝑇𝐫𝑡superscript𝑒superscriptsubscript0𝑡𝜎𝐫𝑠differential-d𝑠T(\mathbf{r},t)=e^{-\int_{0}^{t}\sigma\left(\mathbf{r},s\right)\,ds},italic_T ( bold_r , italic_t ) = italic_e start_POSTSUPERSCRIPT - ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_σ ( bold_r , italic_s ) italic_d italic_s end_POSTSUPERSCRIPT ,

C(𝐫)𝐶𝐫C(\mathbf{r})italic_C ( bold_r ) is the output color for a given ray 𝐫𝐫\mathbf{r}bold_r, σ(𝐫,t)𝜎𝐫𝑡\sigma\left(\mathbf{r},t\right)italic_σ ( bold_r , italic_t ) is the opacity along the ray and c(𝐫,t)𝑐𝐫𝑡c\left(\mathbf{r},t\right)italic_c ( bold_r , italic_t ) is the view-dependent emitted radiance. 3DGS represents a scene as a mixture of N𝑁Nitalic_N 3D Gaussians each given by:

G(𝐱)=e12(𝐱μ)TΣ1(𝐱μ),   where𝐺𝐱superscript𝑒12superscript𝐱𝜇𝑇superscriptΣ1𝐱𝜇,   whereG(\mathbf{x})=e^{-\frac{1}{2}(\mathbf{x}-\mathbf{\mu})^{T}\Sigma^{-1}(\mathbf{% x}-\mathbf{\mu})}\text{, \hskip 5.0ptwhere}italic_G ( bold_x ) = italic_e start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( bold_x - italic_μ ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT roman_Σ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( bold_x - italic_μ ) end_POSTSUPERSCRIPT , where
Σ=RSSTRT,Σ𝑅𝑆superscript𝑆𝑇superscript𝑅𝑇\Sigma=RSS^{T}R^{T},roman_Σ = italic_R italic_S italic_S start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_R start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ,

μ𝜇\mathbf{\mu}italic_μ is the Gaussian’s location, R𝑅Ritalic_R is a rotation matrix and S𝑆Sitalic_S is a diagonal scaling matrix, allowing to position, rotate and non-uniformly scale Gaussians in 3D space while ensuring that ΣΣ\Sigmaroman_Σ is positive semi-definite. When evaluating a 3D Gaussian along a ray, the resulting projection is a 1D Gaussian. It seems natural to evaluate Eqn. (1) considering how multiple Gaussians influence any location along the ray. As there is no elementary indefinite integral known for Gaussians, numerical integration is likely the only option. In practice, this would require a strict sorting of all starting and end points of all Gaussians and sampled numerical integration.

Instead, 3DGS makes multiple simplifications. First, they consider all Gaussians to be separated in space, i.e., compress their extent to a Dirac delta along the ray. Second, the Dirac delta of the i𝑖iitalic_i-th Gaussian is located at

(2) ti=μiT𝐯,subscript𝑡𝑖superscriptsubscript𝜇𝑖𝑇𝐯t_{i}=\mathbf{\mu}_{i}^{T}\mathbf{v},italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_v ,

i.e., the projection of the mean μisubscript𝜇𝑖\mathbf{\mu}_{i}italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT onto the view direction 𝐯𝐯\mathbf{v}bold_v, independent of the individual ray 𝐫𝐫\mathbf{r}bold_r. Third, they approximate the projection of the Gaussian onto all rays, relying on an orthogonal projection approximation considering the first derivative of the 3D Gaussian to construct a 2D splat G2subscript𝐺2G_{2}italic_G start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT (Zwicker et al., 2002).

These approximations enable faster rendering: Eqn. (1) becomes

(3) C(𝐫)=i=1N𝐫𝐜iαij=1i1(1αj),𝐶𝐫superscriptsubscript𝑖1subscript𝑁𝐫subscript𝐜𝑖subscript𝛼𝑖superscriptsubscriptproduct𝑗1𝑖11subscript𝛼𝑗C(\mathbf{r})=\sum_{i=1}^{N_{\mathbf{r}}}\mathbf{c}_{i}\alpha_{i}\prod_{j=1}^{% i-1}(1-\alpha_{j}),italic_C ( bold_r ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT bold_r end_POSTSUBSCRIPT end_POSTSUPERSCRIPT bold_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∏ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i - 1 end_POSTSUPERSCRIPT ( 1 - italic_α start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ,

where i𝑖iitalic_i iterates over the N𝐫subscript𝑁𝐫N_{\mathbf{r}}italic_N start_POSTSUBSCRIPT bold_r end_POSTSUBSCRIPT Gaussians that influence the ray in the ordering of tisubscript𝑡𝑖t_{i}italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, and αisubscript𝛼𝑖\alpha_{i}italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the opacity of the Gaussian along the ray, i.e., G2(x,y)subscript𝐺2𝑥𝑦G_{2}(x,y)italic_G start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_x , italic_y ), multiplied by a learned per-Gaussian opacity value.

Because tisubscript𝑡𝑖t_{i}italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is independent of 𝐫𝐫\mathbf{r}bold_r, a global sort of all t𝑡titalic_t is possible. Naïvely, this would lead to N𝐫=Nsubscript𝑁𝐫𝑁N_{\mathbf{r}}=Nitalic_N start_POSTSUBSCRIPT bold_r end_POSTSUBSCRIPT = italic_N for all rays. To reduce the number of Gaussians considered per ray, 3DGS splits the image into 16×16161616{\times}1616 × 16 pixel tiles, and runs a combined depth and tile sorting pre-pass, before evaluating Eqn. (3). For each tile and each Gaussian that may potentially contribute to any pixel in this tile—considering the 2D bounding box around the 1%percent11\%1 % Gaussian contribution threshold—a sorting key is generated with the tile index in the higher order bits and the depth in the lower bits. Sorting those combined keys leads to a tisubscript𝑡𝑖t_{i}italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT-sorted list for each tile.

2.2. Radiance Field Methods

Contrary to 3DGS, NeRFs (Mildenhall et al., 2020) require sampling a continuous, implicit neural scene representation densely. Therefore, real-time rendering as well as handling unbounded scenes proves difficult. Many follow-up works investigated NeRF extensions to handle unbounded scenes (Barron et al., 2021, 2022, 2023) as well as faster rendering (Müller et al., 2022; Fridovich-Keil et al., 2022; Chen et al., 2022), 3D scene editing (Nguyen-Phuoc et al., 2022; Kuang et al., 2023; Jambon et al., 2023), avatar generation (Zielonka et al., 2023), scene dynamics (Pumarola et al., 2020; Park et al., 2021) and 3D object generation (Jain et al., 2022; Poole et al., 2022; Raj et al., 2023).

2.3. 3DGS Follow-up Work

Following the code release and subsequent publication of 3DGS, several extensions have popped up investigating various paradigms, including the editing of trained Gaussians (Chen et al., 2023; Fang et al., 2023), text-to-3D (Tang et al., 2023; Yi et al., 2023) and 4D novel view synthesis (Luiten et al., 2024; Wu et al., 2023). Mip-Splatting (Yu et al., 2023) proposes a 3D smoothing filter and 2D Mip filter to remedy aliasing in 3DGS. Besides them, most approaches merely leverage Gaussians as graphics primitives, whereas our approach tackles current problems with 3DGS.

2.4. Software Rasterization

Our compute-mode rendering pipeline for 3DGS is related to other software-based rendering pipelines. Early works like Pomegranate (Eldridge et al., 2000) and the Larrabee project (Seiler et al., 2008) showed that software pipelines on custom hardware are viable for rendering. Special compute-mode rendering pipelines have been proposed for REYES (Zhou et al., 2009; Tzeng et al., 2010), triangle rasterization (Liu et al., 2010; Laine and Karras, 2011; Patney et al., 2015; Kenzel et al., 2018; Karis et al., 2021) and point clouds (Schütz et al., 2021). Similarly to these efforts, we show that taking into account the specifics of the rendering problem, a compute-mode renderer for sorted Gaussian splats can execute in real-time on modern GPUs.

2.5. Order Independent Transparency

Correctly and efficiently rendering semi-transparent primitives, such as Gaussian splats, proves intricate for rasterization-based renderers. Methods approximating order independent transparency (Wyman, 2016) investigate this paradigm. k𝑘kitalic_k-buffers (Callahan et al., 2005; Bavoil et al., 2007) operate with a fixed per-pixel memory budget, circumventing the large memory requirement of A𝐴Aitalic_A-buffers (Carpenter, 1984). When this budget is exceeded, new incoming fragments are either merged (Salvi et al., 2011; Salvi and Vaidyanathan, 2014) or the closest fragment gets written to the color buffer (Callahan et al., 2005); both cases require a nearly-sorted order for incoming fragments. Our work combines hierarchical levels of k𝑘kitalic_k-buffers with 3DGS’s tile-based rasterization.

3. Real-time Sorted Gaussian Splatting

We present a novel per-pixel sorted 3D Gaussian splatting approach, departing from the current global sorting paradigm. Utilizing fast per-pixel depth calculations and a hierarchical intra-tile cooperative sorting approach, our method enhances the accuracy of the resulting sort order. To streamline computations, we incorporate per-tile opacity culling and a fast and GPU-friendly load balancing scheme.

{subcaptionblock}

0.49 Refer to caption{subcaptionblock}0.49 Refer to caption

Figure 3. Our approach to compute toptsubscript𝑡𝑜𝑝𝑡t_{opt}italic_t start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT avoids pop** by placing splats at the point of maximum contribution along the view ray 𝐫𝐫\mathbf{r}bold_r, creating sort orders independent of camera rotation (red view vector). Note that the shape of toptsubscript𝑡𝑜𝑝𝑡t_{opt}italic_t start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT is a curved surface and changes with the camera position; cf. Fig. 2.

3.1. Global Sorting

{subcaptionblock}

[C].02 full sorted {subcaptionblock}[C].24 Refer to caption {subcaptionblock}[C].24 Refer to caption {subcaptionblock}[C].24 Refer to caption {subcaptionblock}[C].24 Refer to caption

3DGS trained
Ours trained
3DGS trained
Ours trained
{subcaptionblock}

[C].02 3DGS rendered {subcaptionblock}[C].19 Refer to caption {subcaptionblock}[C].19 Refer to caption {subcaptionblock}[C].19 Refer to caption {subcaptionblock}[C].19 Refer to caption {subcaptionblock}[C].19 Refer to caption

{subcaptionblock}

[C].02 Sort Error {subcaptionblock}[C].19 Refer to caption {subcaptionblock}[C].19 Refer to caption {subcaptionblock}[C].19 Refer to caption {subcaptionblock}[C].19 Refer to caption {subcaptionblock}[C].19 Refer to caption

{subcaptionblock}

[C].02 3DGS rendered {subcaptionblock}[C].19 Refer to caption {subcaptionblock}[C].19 Refer to caption {subcaptionblock}[C].19 Refer to caption {subcaptionblock}[C].19 Refer to caption {subcaptionblock}[C].19 Refer to caption

{subcaptionblock}

[C].02 Sort Error {subcaptionblock}[C].19 Refer to caption {subcaptionblock}[C].19 Refer to caption {subcaptionblock}[C].19 Refer to caption {subcaptionblock}[C].19 Refer to caption {subcaptionblock}[C].19 Refer to caption

full sort
3DGS
resort 4
resort 24
Ours
{subcaptionblock}

[b] 0 Refer to caption 10.0

Figure 4. Correct rendering of a trained 3DGS scene with per-pixel sorting reveals how 3DGS cheats with the location of Gaussians. Our approach, on the other hand, considers correct sorting during training and rendering. Below, we show the sort error of different resorting windows and our full approach cf. Tab. 1. We intentionally use the trained 3DGS model here, as our trained version does not show these kinds of artifacts for visualization. The error visualization captures the sum over the depth difference of all wrongly sorted neighbors. For resorting with a window size of 4, tile artifacts are still visible. Our approach hardly diverges from fully sorted rendering, while running 100×100\times100 × faster; it is also about 5×5\times5 × faster than resort 24 and on average only 4% slower than 3DGS.

3DGS (Kerbl et al., 2023) performs a global sort based on the view-space z𝑧zitalic_z-coordinate of each Gaussian’s mean μ𝜇\muitalic_μ, see Eqn. (2). This leads to a consistent sort order during translation, but not during rotation, as illustrated in Fig. 2. While 3DGS may use this fact during training to introduce differences between views (and thus reduce the loss), it is in general undesirable, as camera rotations can lead to pop** artifacts, which are particularly disturbing when inspecting the optimized 3D scene. Our objective is to stabilize color computations under rotation by splatting Gaussians based on the point of highest contribution along each view ray. Note that, although we improve rendering consistency, we still approximate true 3D Gaussians, neglecting any overlap between them.

3.2. Per-pixel Depth and Naïve Sorting

When replacing a 1D Gaussian along the view ray with a Dirac impulse, the mean/maximum of this 1D Gaussian is arguably the best discrete blend location. This maximum, toptsubscript𝑡𝑜𝑝𝑡t_{opt}italic_t start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT, can be computed from the derivative of the 3D Gaussian along the view ray 𝐫(t)=𝐨+t𝐝𝐫𝑡𝐨𝑡𝐝\mathbf{r}(t)=\mathbf{o}+t\mathbf{d}bold_r ( italic_t ) = bold_o + italic_t bold_d:

(4) toptsubscript𝑡𝑜𝑝𝑡\displaystyle t_{opt}italic_t start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT =𝐝TΣ1(μ𝐨)𝐝TΣ1𝐝.absentsuperscript𝐝𝑇superscriptΣ1𝜇𝐨superscript𝐝𝑇superscriptΣ1𝐝\displaystyle=\frac{\mathbf{d}^{T}\Sigma^{-1}(\mathbf{\mu}-\mathbf{o})}{% \mathbf{d}^{T}\Sigma^{-1}\mathbf{d}}.= divide start_ARG bold_d start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT roman_Σ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_μ - bold_o ) end_ARG start_ARG bold_d start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT roman_Σ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_d end_ARG .

Please see Appendix A for the step-by-step derivation.

Consider a simple 2D case with an isotropic Gaussian Σ1=𝐈superscriptΣ1𝐈\Sigma^{-1}=\mathbf{I}roman_Σ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT = bold_I, the camera at (0,0)00(0,0)( 0 , 0 ) and the Gaussian at μ=(0,μy)𝜇0subscript𝜇𝑦\mathbf{\mu}=(0,\mu_{y})italic_μ = ( 0 , italic_μ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ). It is easy to see that the depth function follows a cosine as 𝐝𝐝\mathbf{d}bold_d is normalized:

t=𝐝T𝐈μ𝐝T𝐈𝐝=dyμy=cos(θ)μy,𝑡superscript𝐝𝑇𝐈𝜇superscript𝐝𝑇𝐈𝐝subscript𝑑𝑦subscript𝜇𝑦𝜃subscript𝜇𝑦t=\frac{\mathbf{d}^{T}\mathbf{I}\mathbf{\mu}}{\mathbf{d}^{T}\mathbf{I}\mathbf{% d}}=d_{y}\cdot\mu_{y}=\cos(\theta)\mu_{y},italic_t = divide start_ARG bold_d start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_I italic_μ end_ARG start_ARG bold_d start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_Id end_ARG = italic_d start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ⋅ italic_μ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT = roman_cos ( italic_θ ) italic_μ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ,

where θ𝜃\thetaitalic_θ is the angle of the view ray. Thus, we conclude that there is no simple primitive, like, e.g., a plane to represent the toptsubscript𝑡𝑜𝑝𝑡t_{opt}italic_t start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT which could be rasterized traditionally, see Fig. 3. Therefore, we compute toptsubscript𝑡𝑜𝑝𝑡t_{opt}italic_t start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT on a per-ray basis.

When reconstructing surfaces, Gaussians often turn very flat, as such, Σ1superscriptΣ1\Sigma^{-1}roman_Σ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT may become large and lead to instabilities in the computation. Bounding the entries of S1superscript𝑆1S^{-1}italic_S start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT to 103superscript10310^{3}10 start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT removes those instabilities in our experiments, by effectively thickening very thin Gaussians, with minimal impact on the computed depth.

With the computation of toptsubscript𝑡𝑜𝑝𝑡t_{opt}italic_t start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT in place, we can eliminate all pop** artifacts and ensure perfect view-consistency by sorting all Gaussians per ray by their toptsubscript𝑡𝑜𝑝𝑡t_{opt}italic_t start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT value. Unfortunately, even the simplest 3DGS reconstructions consist of tens of thousands of Gaussians, often leading to thousands of potentially contributing Gaussians per view ray. Furthermore, early ray termination cannot be performed before sorting, as it is dependent on the sort order. Even an optimized parallel per-ray sort on top of the original 3DGS tile-based rasterizer leads to slowdowns of more than 100×100\times100 ×, not only making the approach impractical for real-time rendering, but also impeding optimization.

3.3. Per-tile Sorting and Local Resorting

Refer to caption
(a) w/o per-tile depth
Refer to caption
(b) w/ naïve per-tile depth
Figure 5. Comparison of 3DGS with and without per-tile depth calculation. Per-tile depth calculation lowers sorting errors (δmax=4.01,δavg=0.284formulae-sequencesubscript𝛿𝑚𝑎𝑥4.01subscript𝛿𝑎𝑣𝑔0.284\delta_{max}=4.01,\delta_{avg}=0.284italic_δ start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT = 4.01 , italic_δ start_POSTSUBSCRIPT italic_a italic_v italic_g end_POSTSUBSCRIPT = 0.284 compared to δmax=5.43,δavg=0.898formulae-sequencesubscript𝛿𝑚𝑎𝑥5.43subscript𝛿𝑎𝑣𝑔0.898\delta_{max}=5.43,\delta_{avg}=0.898italic_δ start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT = 5.43 , italic_δ start_POSTSUBSCRIPT italic_a italic_v italic_g end_POSTSUBSCRIPT = 0.898). However, doing this without additional per-pixel sorting leads to artifacts at the tile borders.

Although it is not possible to describe toptsubscript𝑡𝑜𝑝𝑡t_{opt}italic_t start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT with a simple primitive for rasterization, we may still rely on the fact that toptsubscript𝑡𝑜𝑝𝑡t_{opt}italic_t start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT is smooth across neighboring rays. As such, the sorting order of neighboring rays should also be similar. Because sorting in 3DGS already happens with a combined tile/depth key, we could replace the global depth with an accurate per-tile depth value for each Gaussian, e.g., using the tile center ray for Eqn. (4). As can be seen in Fig. 5, using per-tile depth clearly leads to artifacts along the tile borders.

With that in mind, we propose a simple per-ray resorting extension. Instead of immediately blending the next Gaussian when walking through the tile list, we keep a small resorting window in registers. When loading a Gaussian, we evaluate its toptsubscript𝑡𝑜𝑝𝑡t_{opt}italic_t start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT and use insertion sort to place it in the resorting window. If the window overflows, we blend the sample with the smallest depth. This simple method follows the idea of k𝑘kitalic_k-buffers (Callahan et al., 2005; Bavoil et al., 2007) without fragment merging, which requires the Gaussians along a ray to be nearly-sorted. Although this sorting strategy is easy to implement, it already achieves good results for a resorting window of about 16161616 to 24242424, removing the majority of visible pop** artifacts in our tested scenes. To confirm the improvement in blending order, we compute a per-ray sort error δ𝛿\deltaitalic_δ: If two consecutive Gaussians are out of order, we accumulate their difference in toptsubscript𝑡𝑜𝑝𝑡t_{opt}italic_t start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT. We present a visual example in Fig. 4, with corresponding runtimes and δ𝛿\deltaitalic_δ in Tab. 1 — evidently, even though δ𝛿\deltaitalic_δ decreases with a larger resorting window, there is a non-negligible increase in runtime.

Refer to caption
(a) w/o tile-based culling
Refer to caption
(b) w/ tile-based culling
Refer to caption

0       5k

Figure 6. Number of Gaussians per tile with and without tile-based culling for the Mip-NeRF 360 Garden scene. The average number of Gaussians per tile is reduced by 44%similar-toabsentpercent44\sim 44\%∼ 44 %.
Table 1. Maximum sort error over all pixels and average sort error for two representative example views from Fig. 4. A full sort per ray increases rendering times (relative to 3DGS) by more than 100×100\times100 ×. Local resorting with a sort window of 16161616 to 24242424 removes the majority of visible pop** artifacts, yet increases rendering time 2222 to 6×6\times6 ×. Our hierarchical approach improves sort quality further and keeps processing time low. Note that a larger sorting window may lead to more Gaussians being fetched and thus our measurement of δmaxsubscript𝛿max\delta_{\text{max}}italic_δ start_POSTSUBSCRIPT max end_POSTSUBSCRIPT may increase with larger sort windows.
3DGS Full Resorting Window Ours
4 8 16 24
Train δmaxsubscript𝛿max\delta_{\text{max}}italic_δ start_POSTSUBSCRIPT max end_POSTSUBSCRIPT 28.445 0.000 5.867 3.882 3.544 4.580 0.575
δavgsubscript𝛿avg\delta_{\text{avg}}italic_δ start_POSTSUBSCRIPT avg end_POSTSUBSCRIPT 3.688 0.000 0.124 0.045 0.014 0.007 0.003
time[ms] 1.00 142.03 1.21 1.66 2.70 4.22 0.92
Bonsai δmaxsubscript𝛿max\delta_{\text{max}}italic_δ start_POSTSUBSCRIPT max end_POSTSUBSCRIPT 33.543 0.000 12.786 8.954 6.391 5.595 3.098
δavgsubscript𝛿avg\delta_{\text{avg}}italic_δ start_POSTSUBSCRIPT avg end_POSTSUBSCRIPT 3.786 0.000 0.265 0.110 0.039 0.019 0.006
time[ms] 1.00 179.70 1.76 2.58 4.33 6.88 1.47

3.4. Hierarchical Rendering

Local resorting is already able to significantly improve the per-pixel sort order, which greatly reduces pop** artifacts. To tackle the imposed performance overhead, we insert additional resorting levels between tiles and individual threads, creating a sort hierarchy. In this way, we can share sorting efforts between neighboring rays, while incrementally refining the sort order as we move towards individual rays. By additionally culling non-contributing Gaussians at every level of the hierarchy, we can drastically reduce sorting costs. We propose a hierarchical rendering pipeline that relies on the innate memory and execution hierarchy of the GPU to minimize the number of memory access operations, as outlined in Fig. 7. For a fair comparison, we intentionally only alter the blend order of Gaussians and leave the other parts of 3DGS untouched, including the 2D splatting approximation from Zwicker et al. (2002).

Refer to caption
Figure 7. Overview of the detailed steps in our pipeline. We add load balancing, tile culling and per-tile depth evaluation to the first two stages of 3DGS. Our hierarchical rasterizer utilizes three sorted queues, going from 4×4444{\times}44 × 4 tiles over 2×2222{\times}22 × 2 tiles to individual rays. The queues store only id and the tile’s toptsubscript𝑡𝑜𝑝𝑡t_{opt}italic_t start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT per Gaussian, while additional information is re-fetched from global memory on demand, and shared between threads via shuffle operations. Depending on the queue fill levels, we switch between different cooperative group sizes while ensuring the queues remain filled for effective sorting. Our pipeline achieves an overall sorting window of 25-72 elements.
Tile-based culling

We propose a fast tile-based culling approach that bounds Gaussians to exactly those tiles they contribute to. For each ray, Kerbl et al. (2023) disregard Gaussians with a contribution below ϵO=1/255subscriptitalic-ϵ𝑂1255\epsilon_{O}=1/255italic_ϵ start_POSTSUBSCRIPT italic_O end_POSTSUBSCRIPT = 1 / 255, which forms an exact culling condition. Like 3DGS, we start with an axis-aligned bounding rectangle using the largest eigenvalue of the 2D covariance matrix to determine which tiles may potentially be touched during both Preprocess and Duplication. This conservative estimate gives very large bounds for highly anisotropic Gaussians.

For exact culling, we then calculate the point 𝐱^^𝐱\mathbf{\hat{\mathbf{x}}}over^ start_ARG bold_x end_ARG inside each tile X𝑋Xitalic_X that maximizes the 2D Gaussian’s contribution G2(𝐱)subscript𝐺2𝐱G_{2}(\mathbf{x})italic_G start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( bold_x ):

(5) 𝐱^=argmax𝐱XG2(𝐱)=argmin𝐱X(𝐱μ𝟐)TΣ21(𝐱μ𝟐).\mathbf{\hat{\mathbf{x}}}=\operatorname*{arg\,max}_{\mathbf{x}\in X}G_{2}(% \mathbf{x})=\operatorname*{arg\,min}_{\mathbf{x}\in X}(\mathbf{x}-\mathbf{\mu_% {2}})^{T}\Sigma_{2}^{-1}(\mathbf{x}-\mathbf{\mu_{2}}).over^ start_ARG bold_x end_ARG = start_OPERATOR roman_arg roman_max end_OPERATOR start_POSTSUBSCRIPT bold_x ∈ italic_X end_POSTSUBSCRIPT italic_G start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( bold_x ) = start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT bold_x ∈ italic_X end_POSTSUBSCRIPT ( bold_x - italic_μ start_POSTSUBSCRIPT bold_2 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT roman_Σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( bold_x - italic_μ start_POSTSUBSCRIPT bold_2 end_POSTSUBSCRIPT ) .

If μ2X𝐱^=μ𝟐subscript𝜇2𝑋^𝐱subscript𝜇2\mu_{2}\in X\Rightarrow\mathbf{\hat{\mathbf{x}}}=\mathbf{\mu_{2}}italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ italic_X ⇒ over^ start_ARG bold_x end_ARG = italic_μ start_POSTSUBSCRIPT bold_2 end_POSTSUBSCRIPT. If μ2Xsubscript𝜇2𝑋\mu_{2}\notin Xitalic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∉ italic_X, then 𝐱^^𝐱\mathbf{\hat{\mathbf{x}}}over^ start_ARG bold_x end_ARG must lie on one of the two tile edges closest to μ2subscript𝜇2\mu_{2}italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, due to Gaussians being monotonic along rays pointing away from μ𝟐subscript𝜇2\mathbf{\mu_{2}}italic_μ start_POSTSUBSCRIPT bold_2 end_POSTSUBSCRIPT. We can then compute the maximum along those two edges (similar to Eqn. (4), but in 2D) and clamp the resulting values to obtain 𝐱^^𝐱{\hat{\mathbf{x}}}over^ start_ARG bold_x end_ARG (see the Appendix B.1 for the full algorithm). Finally, we evaluate G2(𝐱^)subscript𝐺2^𝐱G_{2}(\mathbf{\hat{\mathbf{x}}})italic_G start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( over^ start_ARG bold_x end_ARG ) to perform the comparison with ϵOsubscriptitalic-ϵ𝑂\epsilon_{O}italic_ϵ start_POSTSUBSCRIPT italic_O end_POSTSUBSCRIPT, which significantly reduces the number of Gaussians per tile (cf. Fig. 6).

Tile-depth Adjustment

For pre-sorting we require a representative toptsubscript𝑡𝑜𝑝𝑡t_{opt}italic_t start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT per tile. Intuitively, the center ray of the tile should be a valid compromise for all rays in the tile. However, this completely ignores the fact that a Gaussian in general does not uniformly contribute to all rays in a tile. Especially for small Gaussians whose main extent is approximately parallel to the view rays, the center ray may result in depth estimates far away from any contribution made by the Gaussian.

Arguably, the weighted integral XG2(𝐱)topt(𝐱)𝑑𝐱subscript𝑋subscript𝐺2𝐱subscript𝑡𝑜𝑝𝑡𝐱differential-d𝐱\int_{X}G_{2}(\mathbf{x})t_{opt}(\mathbf{x})d\mathbf{x}∫ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT italic_G start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( bold_x ) italic_t start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT ( bold_x ) italic_d bold_x is a better estimate. Yet, even a numerical approximation considering all rays in the tile X𝑋Xitalic_X is too compute-intensive. Thus, we approximate it with a single sample: the one with the highest weight within a tile, i.e., 𝐱^^𝐱\mathbf{\hat{\mathbf{x}}}over^ start_ARG bold_x end_ARG. Since 𝐱^^𝐱\mathbf{\hat{\mathbf{x}}}over^ start_ARG bold_x end_ARG was already calculated during culling, we only need to construct the corresponding ray to evaluate toptsubscript𝑡𝑜𝑝𝑡t_{opt}italic_t start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT. The optimized depth location reduces (δmax,δavg)subscript𝛿maxsubscript𝛿avg(\delta_{\text{max}},\delta_{\text{avg}})( italic_δ start_POSTSUBSCRIPT max end_POSTSUBSCRIPT , italic_δ start_POSTSUBSCRIPT avg end_POSTSUBSCRIPT ) from (1.553(1.553( 1.553, 0.006)0.006)0.006 ) to (0.575(0.575( 0.575, 0.003)0.003)0.003 ) and (3.917(3.917( 3.917, 0.014)0.014)0.014 ) to (3.098(3.098( 3.098, 0.006)0.006)0.006 ) for the views in Tab. 1.

Load Balancing

Similar to other compute-mode rasterization methods, primitives that cover a large portion of the screen may become an issue if a single thread evaluates their coverage. For 3DGS, this is the case in the first two stages of the rendering pipeline, which operate on a per-Gaussian basis. For our method, tile-culling and per-tile depth calculations increase the workload of these stages, which further amplifies this problem.

To remedy this issue, we propose a two-stage load balancing scheme: In the first phase, each thread responsible for a Gaussian which covers fewer than a predetermined maximum number of tiles, performs its own processing. We empirically determined that a maximum of 32323232 tiles results in good performance. Most threads are typically idle after this initial phase. In the second phase, we distribute the remaining workload within each warp using warp voting and shuffle instructions. For close-ups and high-resolution rendering, where single Gaussians often cover a large portion of the screen, our approach can speed up Preprocess and Duplication by up to 10×10\times10 ×.

Hierarchically Sorted Rendering

With the goal of establishing a hierarchical rendering pipeline, a naïve approach is to design one kernel per hierarchy level. However, such an approach would require communication via slow global memory between the levels and would prohibit early ray termination after reaching the opacity threshold. Thus, we opt for combining the final three levels of our rendering hierarchy in a single kernel, where multiple threads cooperatively sort and manage shared queues, as detailed in Fig. 7. We use a large 4×4444{\times}44 × 4 tile-queue of 64 elements (managed by 16 threads), feeding into four eight-element 2×2222{\times}22 × 2 tile-queues. Finally, each 2×2222{\times}22 × 2 tile-queue feeds into four per-pixel queues with four elements, managed by one thread each. For one 16×16161616{\times}1616 × 16 tile, we thus start 256 threads, allocate 16 4×4444{\times}44 × 4 tile-queues and 64 2×2222{\times}22 × 2 tile-queues in shared memory as well as one per-pixel queue per thread in registers. Each queue stores only the Gaussian’s id and the current level’s depth toptsubscript𝑡𝑜𝑝𝑡t_{opt}italic_t start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT. Additional information is loaded on demand from global memory and shared between threads of the hierarchical level via shuffle operations, e.g. μ,Σ1𝜇superscriptΣ1\mu,\Sigma^{-1}italic_μ , roman_Σ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT during depth calculation, or μ2,Σ21subscript𝜇2superscriptsubscriptΣ21\mu_{2},\Sigma_{2}^{-1}italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , roman_Σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT during culling and blending.

The queues follow a push methodology to keep queue fill rates as high as possible, ensuring that resorting remains effective. While 16 threads (a halfwarp) are assigned to each 4×4444{\times}44 × 4 tile-queue, we load and feed batches of 32 into two 4×4444{\times}44 × 4 tile-queues at once, allowing all threads within a warp to load data together. After loading, each thread performs tile-based culling (as described before, but for a 4×4444{\times}44 × 4 tile), followed by computing toptsubscript𝑡𝑜𝑝𝑡t_{opt}italic_t start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT. For culled Gaussians, we set topt=subscript𝑡𝑜𝑝𝑡t_{opt}=\inftyitalic_t start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT = ∞. Then, each halfwarp sorts the 32 newly loaded elements using Batcher Merge Sort (Batcher, 1968) before writing them to the back of the 4×4444{\times}44 × 4 tile-queue. Typically, there are now two individually sorted parts in the 4×4444{\times}44 × 4 tile-queue: the already present elements (up to 32) and the newly added (up to 32). As both are sorted, we use efficient merge sort to combine them. Culled Gaussians are now at the back of the queue and can be discarded.

While there are more than 32 elements in the 4×4444{\times}44 × 4 tile-queue, we push batches of size 16 into the 2×2222{\times}22 × 2 tile-queue. Each thread in the halfwarp re-fetches the data needed for computing toptsubscript𝑡𝑜𝑝𝑡t_{opt}italic_t start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT for a single Gaussian. Each group of four threads then pushes sub-batches of size four into their 2×2222{\times}22 × 2 tile-queue, relying on shuffle instructions to update toptsubscript𝑡𝑜𝑝𝑡t_{opt}italic_t start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT for each 2×2222{\times}22 × 2 tile. We follow the same approach as before: we sort the four new entries according to depth, for which we use a simple coordination using shuffle instructions. We then use merge sort to combine the new elements with the existing ones.

After the 2×2222{\times}22 × 2 tile-queue is filled, we draw four elements from it and insert them into the per-pixel queue. Again, we batch-load the needed data using the four threads assigned to the respective 2×2222{\times}22 × 2 tile-queue, and again use shuffle instructions to communicate all relevant information for each Gaussian to all other threads in the 2×2222{\times}22 × 2 tile. We evaluate toptsubscript𝑡𝑜𝑝𝑡t_{opt}italic_t start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT and α𝛼\alphaitalic_α for the respective rays and insert the newly computed data into the per-pixel queue. If the Gaussian’s α𝛼\alphaitalic_α is below ϵOsubscriptitalic-ϵ𝑂\epsilon_{O}italic_ϵ start_POSTSUBSCRIPT italic_O end_POSTSUBSCRIPT, we simply discard it. As we add elements one by one into the per-pixel queue, we rely on simple insertion sort. Only if the per-pixel queue is full, we take one element from it and perform blending, freeing up space for the next element from the 2×2222{\times}22 × 2 tile-queue.

Due to the hierarchical structure, we effectively construct an overall sort window varying between 25252525 and 72727272, where the minimum is hit if the 4×4444{\times}44 × 4 tile-queue is drained down to 17171717 elements, with 4444 elements remaining in the other queues. 72727272 elements are sorted if we fill the 4×4444{{\times}}44 × 4 tile-queue with 64646464 elements and then move 4444 elements through the half-filled 2×2222{\times}22 × 2 tile-queue and the filled per-pixel queue. While our sort setup typically achieves better sorting than a simple per-thread sort window of 25252525, we may occasionally achieve worse sorting, as the higher-level queues are shared between threads.

The sizes of the three queues are variable, with some restrictions. The 4×4444{\times}44 × 4 tile-queue size is constrained to 32n+3232𝑛3232n+3232 italic_n + 32, with n+𝑛superscriptn\in\mathbb{Z}^{+}italic_n ∈ blackboard_Z start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT, as this enables efficient warp-wide merge sort. Similarly, the 2×2222{\times}22 × 2 tile-queue must be of size 4m+44𝑚44m+44 italic_m + 4, with m+𝑚superscriptm\in\mathbb{Z}^{+}italic_m ∈ blackboard_Z start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT, as it is managed by four threads. The per-pixel queue size can be chosen arbitrarily. We heuristically decided on (64/8/4)6484(64/8/4)( 64 / 8 / 4 ) for the three queue sizes, as this achieves a large enough sort window, while limiting shared memory requirements and register pressure, ultimately leading to better performance. We provide ablations for our chosen queue sizes and load balancing thresholds in Appendix E.

3.5. Backward Pass

Contrary to 3DGS, we perform gradient computations in front-to-back blending order, avoiding the large memory overhead required for storing per-pixel sorted Gaussians—which would be needed to restore the correct blending order.

Gradient computation in 3DGS, independent of direction, requires the final accumulated transmittance TN𝐫=j=1Nr(1αj)subscript𝑇subscript𝑁𝐫superscriptsubscriptproduct𝑗1subscript𝑁𝑟1subscript𝛼𝑗T_{N_{\mathbf{r}}}=\prod_{j=1}^{N_{r}}(1-\alpha_{j})italic_T start_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT bold_r end_POSTSUBSCRIPT end_POSTSUBSCRIPT = ∏ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( 1 - italic_α start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) and the final per-pixel color C(𝐫)𝐶𝐫C(\mathbf{r})italic_C ( bold_r ). To compute gradients for the i𝑖iitalic_i-th Gaussian along a view ray, we require the contribution of all subsequently blended Gaussians. Crucially, rather than accumulating the contribution of subsequent Gaussians back-to-front, we use subtraction and division, i.e.

(6) j=i+1N𝐫𝐜jαjk=1j1(1αk)superscriptsubscript𝑗𝑖1subscript𝑁𝐫subscript𝐜𝑗subscript𝛼𝑗superscriptsubscriptproduct𝑘1𝑗11subscript𝛼𝑘\displaystyle\sum_{j=i+1}^{N_{\mathbf{r}}}\mathbf{c}_{j}\alpha_{j}\prod_{k=1}^% {j-1}(1-\alpha_{k})∑ start_POSTSUBSCRIPT italic_j = italic_i + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT bold_r end_POSTSUBSCRIPT end_POSTSUPERSCRIPT bold_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∏ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j - 1 end_POSTSUPERSCRIPT ( 1 - italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) =C(𝐫)Ci,absent𝐶𝐫subscript𝐶𝑖\displaystyle=C(\mathbf{r})-C_{i},= italic_C ( bold_r ) - italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ,
(7) k=iN𝐫(1αk)superscriptsubscriptproduct𝑘𝑖subscript𝑁𝐫1subscript𝛼𝑘\displaystyle\prod_{k=i}^{N_{\mathbf{r}}}(1-\alpha_{k})∏ start_POSTSUBSCRIPT italic_k = italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT bold_r end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( 1 - italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) =TN𝐫Ti,absentsubscript𝑇subscript𝑁𝐫subscript𝑇𝑖\displaystyle=\frac{T_{N_{\mathbf{r}}}}{T_{i}},= divide start_ARG italic_T start_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT bold_r end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_ARG start_ARG italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ,

where Cisubscript𝐶𝑖C_{i}italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the accumulated output color including the i𝑖iitalic_i-th Gaussian in front-to-back order. As we perform the same rendering routine as in the forward pass, including early stop**, the backward pass is equally efficient. Note that this does not change the stability of the gradient computations; 3DGS also relies on a division. Arguably, our approach may even lead to more accurate gradients as the Gaussians blended first along a ray have a higher contribution to the final color and computing those first, will accumulate less floating point errors compared to reversing the computation.

It is imperative that the same exact sort order is used between forward and backward pass to ensure correct gradients. Like 3DGS, we keep the global sort order in memory, which ensures that potentially equal depth values do not lead to different sorting results. In our implementation, we use stable sorting routines throughout: Batcher Merge Sort (Batcher, 1968) is stable by design, our merge sort routines rely on each thread’s rank to establish sort orders among equal depths, and our insertion sort is trivially stable.

Table 2. Image metrics for our method, 3DGS and related work. Results with dagger (\dagger) are reproduced from Kerbl et al. (2023) to facilitate cross-method comparisons. Our quality is comparable to 3DGS. With Opacity Decay, our approach loses slightly less quality than 3DGS.
Dataset Deep Blending Mip-NeRF 360 Indoor Mip-NeRF 360 Outdoor Tanks & Temples
Metric PSNR\uparrow SSIM\uparrow LPIPS\downarrow

F

LIP \downarrow
PSNR\uparrow SSIM\uparrow LPIPS\downarrow

F

LIP \downarrow
PSNR\uparrow SSIM\uparrow LPIPS\downarrow

F

LIP \downarrow
PSNR\uparrow SSIM\uparrow LPIPS\downarrow

F

LIP \downarrow
Mip-NeRF 360\dagger 29.40 0.900 0.245 0.138 31.57 0.914 0.182 0.088 24.42 0.691 0.286 0.170 22.22 0.758 0.256 0.232
Instant-NGP (base)\dagger 23.62 0.797 0.423 0.258 28.65 0.840 0.281 0.120 22.63 0.536 0.444 0.203 21.72 0.723 0.330 0.245
Instant-NGP (big)\dagger 24.96 0.817 0.390 0.222 29.14 0.863 0.241 0.114 22.75 0.567 0.403 0.200 21.92 0.745 0.304 0.241
Plenoxels\dagger 23.09 0.794 0.425 0.244 24.84 0.765 0.366 0.182 21.69 0.513 0.467 0.229 21.09 0.719 0.344 0.262
3DGS 29.46 0.900 0.247 0.131 30.98 0.922 0.189 0.094 24.59 0.727 0.240 0.167 23.71 0.845 0.178 0.199
Ours 29.86 0.904 0.234 0.127 30.62 0.921 0.186 0.099 24.60 0.728 0.235 0.167 23.21 0.843 0.173 0.216
3DGS (Opacity Decay) 28.94 0.894 0.262 0.134 30.57 0.918 0.198 0.097 24.45 0.718 0.261 0.169 23.52 0.839 0.194 0.205
Ours (Opacity Decay) 29.84 0.905 0.241 0.126 30.03 0.917 0.194 0.103 24.46 0.722 0.254 0.169 23.18 0.839 0.184 0.214

4. Evaluation

For evaluation, we follow Kerbl et al. (2023) and use 13 real-world scenes from Mip-NeRF 360 (Barron et al., 2022), Deep Blending (Hedman et al., 2018) and Tanks & Temples (Knapitsch et al., 2017).

Opacity Decay

A viable approach to reduce the total number of Gaussians after optimization is replacing 3DGS’s opacity reset with a standard Opacity Decay during training. Every 50505050 iterations, we multiply each Gaussian’s opacity with a constant ϵdecay=0.9995subscriptitalic-ϵdecay0.9995\epsilon_{\text{decay}}=0.9995italic_ϵ start_POSTSUBSCRIPT decay end_POSTSUBSCRIPT = 0.9995. We find that this modification results in significantly fewer, but larger Gaussians, potentially causing exacerbated pop**.

4.1. Quantitative Evaluation

Image Metrics

For our quantitative evaluation, we report PSNR, SSIM, LPIPS (Zhang et al., 2018) and F LIP (Andersson et al., 2020) in Tab. 2. To facilitate cross-method comparisons, we reproduce the results from Kerbl et al. (2023) for Mip-NeRF 360 (Barron et al., 2022), Instant-NGP (Müller et al., 2022) and Plenoxels (Fridovich-Keil et al., 2022). For Deep Blending and Mip-NeRF 360 Outdoor, we outperform 3DGS. For Tanks & Temples and Mip-NeRF 360 Indoor, our model performs slightly worse, which we attribute to 3DGS’s ability to fake view-dependent effects with pop**. When enabling Opacity Decay, which results in 50%similar-toabsentpercent50{\sim}50\%∼ 50 % fewer Gaussians, our method retains more quality than 3DGS. In general, our approach performs comparably to 3DGS in terms of standard image quality metrics.

Pop**

View inconsistencies between subsequent frames, such as pop**, cannot be detected with standard image quality metrics. To detect such artifacts, we follow recent best practice in 3D style transfer (Nguyen-Phuoc et al., 2022) and measure the consistency between novel views and warped novel views with optical flow (Lai et al., 2018). While ground-truth images or videos may seem attractive, they vary significantly in location and thus view-dependent effects or only exist for a small subset of our used datasets. For our method and 3DGS, we capture videos from three separate camera paths per scene, exhibiting both rotation and translation. We then directly warp each frame 𝐅isubscript𝐅𝑖\mathbf{F}_{i}bold_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to a subsequent frame 𝐅i+tsubscript𝐅𝑖𝑡\mathbf{F}_{i+t}bold_F start_POSTSUBSCRIPT italic_i + italic_t end_POSTSUBSCRIPT with offset t𝑡titalic_t using optical flow predictions from state-of-the-art RAFT (Teed and Deng, 2020).

Measuring the error between the warped frame 𝐅^i+tsubscript^𝐅𝑖𝑡\hat{\mathbf{F}}_{i+t}over^ start_ARG bold_F end_ARG start_POSTSUBSCRIPT italic_i + italic_t end_POSTSUBSCRIPT and the rendered frame 𝐅i+tsubscript𝐅𝑖𝑡\mathbf{F}_{i+t}bold_F start_POSTSUBSCRIPT italic_i + italic_t end_POSTSUBSCRIPT with MSE does not prove effective to detect pop** artifacts (see Fig. 9). MSE tends to weigh small inaccuracies that originate from war** higher than pop** artifacts. F LIP (Andersson et al., 2020) proves significantly more reliable in our experiments, as it approximates the difference perceived by humans when flip** between images — a scenario in which pop** artifacts are particularly disturbing. For each frame, we calculate a consistency error Ei+t=

F

LIP
(𝐅^i+t,𝐅i+t)
subscript𝐸𝑖𝑡

F

LIP
subscript^𝐅𝑖𝑡subscript𝐅𝑖𝑡
E_{i+t}=\text{\reflectbox{F}LIP}(\hat{\mathbf{F}}_{i+t},\mathbf{F}_{i+t})italic_E start_POSTSUBSCRIPT italic_i + italic_t end_POSTSUBSCRIPT = LIP ( over^ start_ARG bold_F end_ARG start_POSTSUBSCRIPT italic_i + italic_t end_POSTSUBSCRIPT , bold_F start_POSTSUBSCRIPT italic_i + italic_t end_POSTSUBSCRIPT )
. For each video, consisting of NFsubscript𝑁𝐹N_{F}italic_N start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT frames, we then compute the mean F LIP error as

(8)

F

LIP
t
=1NFti=0NFtEi+t
.
subscript

F

LIP
𝑡
1subscript𝑁𝐹𝑡superscriptsubscript𝑖0subscript𝑁𝐹𝑡subscript𝐸𝑖𝑡
\text{\reflectbox{F}LIP}_{t}=\frac{1}{N_{F}-t}\sum_{i=0}^{N_{F}-t}E_{i+t}.LIP start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT - italic_t end_ARG ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT - italic_t end_POSTSUPERSCRIPT italic_E start_POSTSUBSCRIPT italic_i + italic_t end_POSTSUBSCRIPT .

Note that the error metric includes a base error floor due to dis-occlusions under translation and correct view-dependent shading. To mitigate these issues, we use an occlusion detection method from Ruder et al. (2016), do not consider the outermost 20202020 pixels, and subtract the per-pixel minimum

F

LIP
t
subscript

F

LIP
𝑡
\text{\reflectbox{F}LIP}_{t}LIP start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT
score — clearly, this does not perturb the inter-method differences.

We use t=1𝑡1t=1italic_t = 1 and t=7𝑡7t=7italic_t = 7 to measure short-range and long-range consistency, following Nguyen-Phuoc et al. (2022). Tab. 3 shows our obtained results. The large margins, particularly for

F

LIP
7
subscript

F

LIP
7
\text{\reflectbox{F}LIP}_{7}LIP start_POSTSUBSCRIPT 7 end_POSTSUBSCRIPT
, highlight that our method is more view-consistent than 3DGS. We argue that

F

LIP
7
subscript

F

LIP
7
\text{\reflectbox{F}LIP}_{7}LIP start_POSTSUBSCRIPT 7 end_POSTSUBSCRIPT
is a more reliable metric, allowing errors due to pop** to accumulate over multiple frames, as can be seen in Fig. 8. Please see the supplementary video for further evidence. With Opacity Decay, our approach achieves virtually identical results, indicating that our method can handle large Gaussians. For 3DGS, pop** is significantly increased, indicating that 3DGS may increase the number of Gaussians to hide imperfections in the renderer, while our approach achieves comparable view-consistency scores.

Table 3. View-consistency metrics for videos. We measure

F

LIP
t
subscript

F

LIP
𝑡
\text{\reflectbox{F}LIP}_{t}LIP start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT
for timesteps t{1,7}𝑡17t\in\{1,7\}italic_t ∈ { 1 , 7 } (lower is better). Our method outperforms 3DGS with and without Opacity Decay.
Dataset DB M360 Indoor M360 Outdoor T&T
Metric

F

LIP
1
subscript

F

LIP
1
\text{\reflectbox{F}LIP}_{1}LIP start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT

F

LIP
7
subscript

F

LIP
7
\text{\reflectbox{F}LIP}_{7}LIP start_POSTSUBSCRIPT 7 end_POSTSUBSCRIPT

F

LIP
1
subscript

F

LIP
1
\text{\reflectbox{F}LIP}_{1}LIP start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT

F

LIP
7
subscript

F

LIP
7
\text{\reflectbox{F}LIP}_{7}LIP start_POSTSUBSCRIPT 7 end_POSTSUBSCRIPT

F

LIP
1
subscript

F

LIP
1
\text{\reflectbox{F}LIP}_{1}LIP start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT

F

LIP
7
subscript

F

LIP
7
\text{\reflectbox{F}LIP}_{7}LIP start_POSTSUBSCRIPT 7 end_POSTSUBSCRIPT

F

LIP
1
subscript

F

LIP
1
\text{\reflectbox{F}LIP}_{1}LIP start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT

F

LIP
7
subscript

F

LIP
7
\text{\reflectbox{F}LIP}_{7}LIP start_POSTSUBSCRIPT 7 end_POSTSUBSCRIPT
Without Opacity Decay
3DGS 0.0061 0.0114 0.0069 0.0134 0.0083 0.0148 0.0102 0.0286
Ours 0.0053 0.0059 0.0060 0.0077 0.0085 0.0122 0.0076 0.0113
With Opacity Decay
3DGS 0.0063 0.0122 0.0072 0.0149 0.0083 0.0154 0.0107 0.0315
Ours 0.0052 0.0055 0.0060 0.0073 0.0083 0.0115 0.0076 0.0114
Depth Evaluation.

3DGS enables efficient extraction of depth values 0pt+0𝑝𝑡subscript0pt\in\mathbb{R}_{+}0 italic_p italic_t ∈ blackboard_R start_POSTSUBSCRIPT + end_POSTSUBSCRIPT with volumetric rendering:

(9) 0pt=i=1N𝐫ϕ(μi;𝐨,𝐝)αij=1i1(1αj),0𝑝𝑡superscriptsubscript𝑖1subscript𝑁𝐫italic-ϕsubscript𝜇𝑖𝐨𝐝subscript𝛼𝑖superscriptsubscriptproduct𝑗1𝑖11subscript𝛼𝑗0pt=\sum_{i=1}^{N_{\mathbf{r}}}\phi(\mu_{i};\mathbf{o},\mathbf{d})\ \alpha_{i}% \prod_{j=1}^{i-1}(1-\alpha_{j}),0 italic_p italic_t = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT bold_r end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_ϕ ( italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; bold_o , bold_d ) italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∏ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i - 1 end_POSTSUPERSCRIPT ( 1 - italic_α start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ,

where ϕ()italic-ϕ\phi(\cdot)italic_ϕ ( ⋅ ) describes the depth of a single Gaussian with location μ𝜇\muitalic_μ (in 3DGS’s case, ϕ(μ;𝐨,𝐝)=μ𝐨2italic-ϕ𝜇𝐨𝐝subscriptnorm𝜇𝐨2\phi(\mu;\mathbf{o},\mathbf{d})=\|\mu-\mathbf{o}\|_{2}italic_ϕ ( italic_μ ; bold_o , bold_d ) = ∥ italic_μ - bold_o ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT). Clearly, this depth estimate is dependent on the sort order, leading to problems for 3DGS’s approximate global sort. Our approach improves sort quality and places 2D splats at the points of maximum contribution (ϕ(μ;𝐨,𝐝)=toptitalic-ϕ𝜇𝐨𝐝subscript𝑡𝑜𝑝𝑡\phi(\mu;\mathbf{o},\mathbf{d})=t_{opt}italic_ϕ ( italic_μ ; bold_o , bold_d ) = italic_t start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT, cf. Eqn. (4)).

We establish a metric to compare multi-view consistency in depth estimates, leveraging the sparse point cloud 𝒫={𝐩i}𝒫subscript𝐩𝑖\mathcal{P}=\{\mathbf{p}_{i}\}caligraphic_P = { bold_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } from COLMAP (Schönberger and Frahm, 2016), which serves as initialization for 3DGS. If 𝐩isubscript𝐩𝑖\mathbf{p}_{i}bold_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is visible from a camera with position 𝐨𝐨\mathbf{o}bold_o, we reconstruct the estimated location 𝐩^i=𝐨+0pt𝐝subscript^𝐩𝑖𝐨0𝑝𝑡𝐝\hat{\mathbf{\mathbf{p}}}_{i}=\mathbf{o}+0pt\cdot\mathbf{d}over^ start_ARG bold_p end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = bold_o + 0 italic_p italic_t ⋅ bold_d, with rendered depth 0pt0𝑝𝑡0pt0 italic_p italic_t and view direction 𝐝𝐝\mathbf{d}bold_d of the corresponding pixel of 𝐩isubscript𝐩𝑖\mathbf{p}_{i}bold_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. The black background for real-world scenes used by 3DGS enables cheating by not fully accumulating opacity and blending the background color. For a fair comparison, if any of the tested methods has TN𝐫>1×102subscript𝑇subscript𝑁𝐫1superscript102T_{N_{\mathbf{r}}}>1\times 10^{-2}italic_T start_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT bold_r end_POSTSUBSCRIPT end_POSTSUBSCRIPT > 1 × 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT for a point 𝐩isubscript𝐩𝑖\mathbf{p}_{i}bold_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, we do not consider this point in our set of tested points 𝒫¯¯𝒫\bar{\mathcal{P}}over¯ start_ARG caligraphic_P end_ARG. To minimize errors due to resolution, we render at the resolution used for COLMAP when computing 𝐩^isubscript^𝐩𝑖\hat{\mathbf{\mathbf{p}}}_{i}over^ start_ARG bold_p end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Finally, we establish the depth error Edepthsubscript𝐸𝑑𝑒𝑝𝑡E_{depth}italic_E start_POSTSUBSCRIPT italic_d italic_e italic_p italic_t italic_h end_POSTSUBSCRIPT as

(10) Edepth=1|𝒫¯|𝐩i𝒫¯𝐩^i𝐩i2.subscript𝐸𝑑𝑒𝑝𝑡1¯𝒫subscriptsubscript𝐩𝑖¯𝒫subscriptnormsubscript^𝐩𝑖subscript𝐩𝑖2E_{depth}=\frac{1}{|\bar{\mathcal{P}}|}\sum_{\mathbf{p}_{i}\in\bar{\mathcal{P}% }}\|\hat{\mathbf{\mathbf{p}}}_{i}-\mathbf{p}_{i}\|_{2}.italic_E start_POSTSUBSCRIPT italic_d italic_e italic_p italic_t italic_h end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG | over¯ start_ARG caligraphic_P end_ARG | end_ARG ∑ start_POSTSUBSCRIPT bold_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ over¯ start_ARG caligraphic_P end_ARG end_POSTSUBSCRIPT ∥ over^ start_ARG bold_p end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - bold_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT .

We compute Edepthsubscript𝐸𝑑𝑒𝑝𝑡E_{depth}italic_E start_POSTSUBSCRIPT italic_d italic_e italic_p italic_t italic_h end_POSTSUBSCRIPT for all test set views and report our results in Tab. 4. On average, our method achieves better scores than 3DGS, especially for the outdoor scenes of Mip-NeRF 360 (Barron et al., 2022). Opacity decay leads to significantly fewer and larger Gaussians, resulting often in lower accumulated opacity and, consequently, more discarded points. Both methods achieve better results for Edepthsubscript𝐸𝑑𝑒𝑝𝑡E_{depth}italic_E start_POSTSUBSCRIPT italic_d italic_e italic_p italic_t italic_h end_POSTSUBSCRIPT in this case, as these removed points often correspond to the background, where depth estimates are generally less precise.

Table 4. Depth-consistency metric Edepthsubscript𝐸𝑑𝑒𝑝𝑡E_{depth}italic_E start_POSTSUBSCRIPT italic_d italic_e italic_p italic_t italic_h end_POSTSUBSCRIPT for 3D points 𝒫¯¯𝒫\bar{\mathcal{P}}over¯ start_ARG caligraphic_P end_ARG from COLMAP (Schönberger and Frahm, 2016) (lower is better). We report the mean results over all test set views. Our method outperforms 3DGS with and without Opacity Decay In total, we consider |𝒫¯|=17404¯𝒫17404|\bar{\mathcal{P}}|=17404| over¯ start_ARG caligraphic_P end_ARG | = 17404 points without opacity decay and |𝒫¯|=11306¯𝒫11306|\bar{\mathcal{P}}|=11306| over¯ start_ARG caligraphic_P end_ARG | = 11306 with opacity decay.
Dataset DB M360 Indoor M360 Outdoor T&T Average
Method Without Opacity Decay
3DGS 0.133 0.219 0.764 1.108 0.552
Ours 0.122 0.242 0.387 0.947 0.388
With Opacity Decay
3DGS 0.095 0.127 0.637 0.967 0.447
Ours 0.073 0.168 0.408 0.916 0.361
{subcaptionblock}

[C].02 View 𝐅isubscript𝐅𝑖\mathbf{F}_{i}bold_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT

F

LIP
1
subscript

F

LIP
1
\text{\reflectbox{F}LIP}_{1}LIP start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT
Warped View 𝐅^i+1subscript^𝐅𝑖1\hat{\mathbf{F}}_{i+1}over^ start_ARG bold_F end_ARG start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT

F

LIP
7
subscript

F

LIP
7
\text{\reflectbox{F}LIP}_{7}LIP start_POSTSUBSCRIPT 7 end_POSTSUBSCRIPT
Warped View 𝐅^i+7subscript^𝐅𝑖7\hat{\mathbf{F}}_{i+7}over^ start_ARG bold_F end_ARG start_POSTSUBSCRIPT italic_i + 7 end_POSTSUBSCRIPT

{subcaptionblock}

[C].02 3DGS {subcaptionblock}[C].19 Refer to caption {subcaptionblock}[C].19 Refer to caption {subcaptionblock}[C].19 Refer to caption {subcaptionblock}[C].19 Refer to caption {subcaptionblock}[C].19 Refer to caption

{subcaptionblock}

[C].02 Ours {subcaptionblock}[C].19 Refer to caption {subcaptionblock}[C].19 Refer to caption {subcaptionblock}[C].19 Refer to caption {subcaptionblock}[C].19 Refer to caption {subcaptionblock}[C].19 Refer to caption

{subcaptionblock}

[C].02 3DGS {subcaptionblock}[C].19 Refer to caption {subcaptionblock}[C].19 Refer to caption {subcaptionblock}[C].19 Refer to caption {subcaptionblock}[C].19 Refer to caption {subcaptionblock}[C].19 Refer to caption

{subcaptionblock}

[C].02 Ours {subcaptionblock}[C].19 Refer to caption {subcaptionblock}[C].19 Refer to caption {subcaptionblock}[C].19 Refer to caption {subcaptionblock}[C].19 Refer to caption {subcaptionblock}[C].19 Refer to caption

{subcaptionblock}

[C].02 3DGS {subcaptionblock}[C].19 Refer to caption {subcaptionblock}[C].19 Refer to caption {subcaptionblock}[C].19 Refer to caption {subcaptionblock}[C].19 Refer to caption {subcaptionblock}[C].19 Refer to caption

{subcaptionblock}

[C].02 Ours {subcaptionblock}[C].19 Refer to caption {subcaptionblock}[C].19 Refer to caption {subcaptionblock}[C].19 Refer to caption {subcaptionblock}[C].19 Refer to caption {subcaptionblock}[C].19 Refer to caption

Figure 8. Visualization of our proposed pop** detection method with detailed views inset. We warp view 𝐅isubscript𝐅𝑖\mathbf{F}_{i}bold_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to 𝐅^i+1,𝐅^i+7subscript^𝐅𝑖1subscript^𝐅𝑖7\hat{\mathbf{F}}_{i+1},\hat{\mathbf{F}}_{i+7}over^ start_ARG bold_F end_ARG start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT , over^ start_ARG bold_F end_ARG start_POSTSUBSCRIPT italic_i + 7 end_POSTSUBSCRIPT using optical flow and use F LIP to measure errors between warped and non-warped views. While

F

LIP
1
subscript

F

LIP
1
\text{\reflectbox{F}LIP}_{1}LIP start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT
is able to effectively detect pop** artifacts, the obtained errors are only accumulated over a single frame. On the contrary,

F

LIP
7
subscript

F

LIP
7
\text{\reflectbox{F}LIP}_{7}LIP start_POSTSUBSCRIPT 7 end_POSTSUBSCRIPT
is able to accumulate errors due to pop** over multiple frames, making this metric more reliable. We increased contrast for the zoomed-in views to better highlight view-inconsistencies.

Non-warped view 𝐅i+1subscript𝐅𝑖1\mathbf{F}_{i+1}bold_F start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT

Warped view 𝐅^i+1subscript^𝐅𝑖1\hat{\mathbf{\mathbf{F}}}_{i+1}over^ start_ARG bold_F end_ARG start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT

F

LIP
1
subscript

F

LIP
1
\text{\reflectbox{F}LIP}_{1}LIP start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT

MSE(𝐅i+1,𝐅^i+1)MSEsubscript𝐅𝑖1subscript^𝐅𝑖1\text{MSE}(\mathbf{F}_{i+1},\hat{\mathbf{\mathbf{F}}}_{i+1})MSE ( bold_F start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT , over^ start_ARG bold_F end_ARG start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT )

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 9. Comparison between F LIP and MSE to measure differences between rendered frames 𝐅i+1subscript𝐅𝑖1\mathbf{F}_{i+1}bold_F start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT and warped frames 𝐅^i+1subscript^𝐅𝑖1\hat{\mathbf{F}}_{i+1}over^ start_ARG bold_F end_ARG start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT for 3DGS. Notably, using MSE does not yield large errors even when disturbing pop** artefacts are encountered — F LIP, on the other hand, weighs such artifacts accordingly.

4.2. Qualitative Evaluation

To complement our quantitative evaluation, we provide image comparisons in Fig. 10 and conduct a user study to verify the effectiveness of our approach and our proposed pop** detection method.

{subcaptionblock}

[C].02 3DGS {subcaptionblock}[C].1575 Refer to caption {subcaptionblock}[C].1575 Refer to caption {subcaptionblock}[C].1575 Refer to caption {subcaptionblock}[C].1575 Refer to caption {subcaptionblock}[C].1575 Refer to caption {subcaptionblock}[C].1575 Refer to caption

{subcaptionblock}

[C].02 Ground-Truth {subcaptionblock}[C].1575 Refer to caption {subcaptionblock}[C].1575 Refer to caption {subcaptionblock}[C].1575 Refer to caption {subcaptionblock}[C].1575 Refer to caption {subcaptionblock}[C].1575 Refer to caption {subcaptionblock}[C].1575 Refer to caption

{subcaptionblock}

[C].02 Ours {subcaptionblock}[C].1575 Refer to caption {subcaptionblock}[C].1575 Refer to caption {subcaptionblock}[C].1575 Refer to caption {subcaptionblock}[C].1575 Refer to caption {subcaptionblock}[C].1575 Refer to caption {subcaptionblock}[C].1575 Refer to caption

Ours PSNR\uparrow
3DGS PSNR\uparrow
Ours PSNR\uparrow
3DGS PSNR\uparrow
Ours PSNR\uparrow
3DGS PSNR\uparrow
Figure 10. Image comparisons of our method and 3DGS. In most configurations, our rendered images are virtually indistinguishable from 3DGS. For each scene, we show a result where our method performs better on the left, and a result where 3DGS performs better on the right.

4.2.1. User Study

18 participants were presented with pairs of videos from our approach and 3DGS, following the same camera path. The captured scenes exhibit rotation, translation, as well as a combination of the two. We instructed the participants to rate the videos concerning view-consistency and pop** artifacts. The participants then indicated whether either of the techniques performed better or equal, which we translated into scores s(1,0,+1)𝑠101s\in(-1,0,+1)italic_s ∈ ( - 1 , 0 , + 1 ). On average, the results showed a clear preference for our approach (smean=0.42subscript𝑠𝑚𝑒𝑎𝑛0.42s_{mean}=0.42italic_s start_POSTSUBSCRIPT italic_m italic_e italic_a italic_n end_POSTSUBSCRIPT = 0.42), which is statistically significant according to Wilcoxon Signed Rank tests (Z=2276.5𝑍2276.5Z=2276.5italic_Z = 2276.5, p<.0001𝑝.0001p<.0001italic_p < .0001). Details about the study can be found in Appendix D.

4.3. Performance and Ablation

In the following, we provide a detailed performance analysis for different configurations of our method. For our timings, we take all available COLMAP poses and interpolate a camera path between them (30 frames per pose), ensuring a variety of plausible viewpoints. All timings were measured for Full HD rendering and averaged over 4 runs, where we used an NVIDIA RTX 4090 with CUDA 11.8.

Performance for different configurations.

We provide a performance comparison between 3DGS and our renderer with different configurations in Tab. 5. On average, the Render stage takes considerably longer for our hierarchical renderer (A-E) due to additional per-ray sorting. Not computing the per-tile depth (B) only marginally speeds up the Duplicate stage. Without our load balancing scheme (C), Duplicate takes 5×\times× longer, as it is mostly dominated by very large Gaussians. Disabling tile-based culling (D) slightly accelerates Preprocess but leads to many more entries in the global sorting data structure, which increases Sort and Render times. Disabling hierarchical culling inside the render kernel (E) leads to a drastic increase in Render time as all Gaussians move through the entire pipeline. Our final approach (A) with all optimizations achieves competitive runtimes on all evaluated scenes. Both methods see a drastic performance increase with Opacity Decay due to the significantly lower number of Gaussians—however, while our approach stays view-consistent, 3DGS shows even more pop** artifacts.

Table 5. Performance timings for different configurations of our method and 3DGS. The number of Gaussians is roughly the same for all methods (scene average similar-to{\sim}2.98M). Applying Opacity Decay during training leads to 50%similar-toabsentpercent50\sim 50\%∼ 50 % fewer Gaussians (scene average similar-to{\sim}1.54M).
Timings in ms Preprocess Duplicate Sort Render Total
Without Opacity Decay
3DGS 0.451 0.567 1.645 2.134 4.797
(A) Ours 0.649 0.437 0.301 3.599 4.986
(B) Ours w/o per-tile depth 0.658 0.283 0.301 3.599 4.841
(C) Ours w/o load balancing 0.847 2.059 0.415 3.505 6.827
(D) Ours w/o tile-based culling 0.610 0.479 1.180 5.346 7.614
(E) Ours w/o hier. culling 0.649 0.437 0.301 5.967 7.364
With Opacity Decay
3DGS 0.215 0.378 0.626 1.059 2.276
Ours 0.366 0.223 0.161 2.227 2.976
Scene Comparison.

Individual scenes with a similar number of Gaussians can exhibit sharp differences in runtime behavior. In Tab. 6 and Tab. 7, we show detailed timings and metrics for two exemplary scenes - Bonsai and Train - which display the largest inter-method differences in performance, despite their comparable number of Gaussians N𝑁Nitalic_N. Even though the Train scene contains slightly fewer Gaussians than Bonsai, the average number of visible (inside the view-frustum) Gaussians NVsubscript𝑁𝑉N_{V}italic_N start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT, as well as their average screen-space size (indicated by avg./std. corresponding image tiles Ntsubscript𝑁𝑡N_{t}italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT), is considerably larger.

As larger Gaussian splats provide more opportunities for culling, our tile-based culling results in a larger reduction of avg. Ntsubscript𝑁𝑡N_{t}italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT for Train than Bonsai (3.5×{\sim}3.5{\times}∼ 3.5 × vs. 2.5×{\sim}2.5{\times}∼ 2.5 ×). The resulting lower number of sort entries allows Train to amortize the slower Render stage with a much faster Sort, while Bonsai does not experience the same gains.

Table 6. Performance timings for different configurations of our method and 3DGS for the exemplary scenes Bonsai & Train, which show contrary runtime behaviors. Times in ms for Full HD resolution.
Timings in ms Preprocess Duplicate Sort Render Total
Bonsai, similar-to{\sim}1.25M Gaussians
3DGS 0.224 0.384 0.700 1.266 2.574
(A) Ours 0.295 0.321 0.173 2.610 3.399
(C) Ours w/o load balancing 0.467 1.920 0.272 2.592 5.251
(D) Ours w/o tile-based culling 0.282 0.331 0.554 3.680 4.846
Train, similar-to{\sim}1.05M Gaussians
3DGS 0.288 0.811 2.451 1.998 5.548
(A) Ours 0.409 0.495 0.270 3.052 4.225
(C) Ours w/o load balancing 0.647 2.336 0.333 2.899 6.215
(D) Ours w/o tile-based culling 0.323 0.542 1.550 5.054 7.469
Table 7. Metrics of our method and 3DGS for exemplary scenes Bonsai & Train, highlighting the effect of our tile-based culling. Columns include total vs. visible (in view-frustum) number of Gaussians (N𝑁Nitalic_N vs. NVsubscript𝑁𝑉N_{V}italic_N start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT), as well as standard deviation and average number of 16×16161616{\times}1616 × 16 tiles covered by each visible Gaussian (Ntsubscript𝑁𝑡N_{t}italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT). We additionally include an approximate number of sort entries as NVavg(Nt)subscript𝑁𝑉avgsubscript𝑁𝑡N_{V}\cdot\text{avg}(N_{t})italic_N start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT ⋅ avg ( italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ).
Scene Method N𝑁Nitalic_N NVsubscript𝑁𝑉N_{V}italic_N start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT avg(Nt)avgsubscript𝑁𝑡\text{avg}(N_{t})avg ( italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) std(Nt)stdsubscript𝑁𝑡\text{std}(N_{t})std ( italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) Sort Entries
Bonsai Ours 1.26M 0.41M 4.198 15.282 1.72M
3DGS 1.24M 0.40M 10.801 52.236 4.36M
Train Ours 1.05M 0.57M 5.004 20.127 2.85M
3DGS 1.08M 0.59M 17.282 89.891 10.2M
Backward Pass Performance.

The relative performance of our backward Render pass compared to 3DGS is only 1.1×1.1\times1.1 × compared to the 1.5×1.5\times1.5 × we see for the forward Render stage. This is mostly due to the backward Render executing a large number of atomics, which are equal between both approaches. Although the backward pass skips Duplicate and Sort—which are faster in our renderer—the final change in training time is only about 3%. The backward Render pass is only a single step in the entire training pipeline and thus, the overall time loss is close to negligible. Again, if we turn on Opacity Decay, training becomes proportionally faster.

5. Conclusion, Limitations, and Future Work

In this paper, we took a closer look at the way 3D Gaussian Splatting orders splats during blending. A detailed analysis of the splat’s depth computation revealed the reason for pop** artifacts of 3DGS: the computed depth is highly inconsistent under rotation. A per-ray depth computation which considers the highest contribution along the ray as optimal blending depth, removes all pop** artifacts but is 100×100\times100 × more costly. With our hierarchical renderer, which includes multiple culling and resorting stages, we are only 1.04×1.04\times1.04 × slower than 3DGS on average. While it is difficult to identify pop** in standard quality metrics, we provided a view-consistency metric based on optical flow and F LIP, which shows that our approach significantly reduces pop**. We could also confirm this fact in a user study and provided an additional metric confirming increased view-consistency and more accurate depth estimates for our method. Furthermore, our approach remains view-consistent even when constructing the scene with half the Gaussians; for which 3DGS shows a significant increase in pop** artifacts. As such, our approach can reduce memory by 2×2\times2 × and render times by 1.6×1.6\times1.6 × compared to 3DGS in this configuration, while reducing pop** artifacts and achieving virtually indistinguishable quality.

While our approach typically removes all artifacts in our tests, resorting does not guarantee the right blend order, and thus could still lead to pop** or flickering for very complex geometric relationships. Furthermore, our approach still ignores overlaps between Gaussians along the view ray. A fully correct volume rendering of Gaussians may not only remove artifacts completely but could lead to better scene reconstructions—a direction certainly worth exploring in the future. Both our renderer and our optimizations for 3DGS are publicly available at https://github.com/r4dl/StopThePop.

References

  • (1)
  • Andersson et al. (2020) Pontus Andersson, Jim Nilsson, Tomas Akenine-Möller, Magnus Oskarsson, Kalle Åström, and Mark D. Fairchild. 2020. FLIP: A Difference Evaluator for Alternating Images. Proceedings of the ACM on Computer Graphics and Interactive Techniques 3, 2, Article 15 (2020), 23 pages.
  • Barron et al. (2021) Jonathan T. Barron, Ben Mildenhall, Matthew Tancik, Peter Hedman, Ricardo Martin-Brualla, and Pratul P. Srinivasan. 2021. Mip-NeRF: A Multiscale Representation for Anti-Aliasing Neural Radiance Fields. In Proceedings of the IEEE/CVF International Conference on Computer Vision.
  • Barron et al. (2022) Jonathan T. Barron, Ben Mildenhall, Dor Verbin, Pratul P. Srinivasan, and Peter Hedman. 2022. Mip-NeRF 360: Unbounded Anti-Aliased Neural Radiance Fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
  • Barron et al. (2023) Jonathan T. Barron, Ben Mildenhall, Dor Verbin, Pratul P. Srinivasan, and Peter Hedman. 2023. Zip-NeRF: Anti-Aliased Grid-Based Neural Radiance Fields. In Proceedings of the IEEE/CVF International Conference on Computer Vision.
  • Batcher (1968) Kenneth E. Batcher. 1968. Sorting Networks and Their Applications. In Proceedings of the Spring Joint Computer Conference.
  • Bavoil et al. (2007) Louis Bavoil, Steven P. Callahan, Aaron Lefohn, João L. D. Comba, and Cláudio T. Silva. 2007. Multi-fragment effects on the GPU using the k-buffer. In Proceedings of the ACM SIGGRAPH Symposium on Interactive 3D Graphics and Games.
  • Broxton et al. (2020) Michael Broxton, John Flynn, Ryan Overbeck, Daniel Erickson, Peter Hedman, Matthew DuVall, Jason Dourgarian, Jay Busch, Matt Whalen, and Paul Debevec. 2020. Immersive Light Field Video with a Layered Mesh Representation. ACM Transactions on Graphics 39, 4, Article 86 (2020), 15 pages.
  • Butler et al. (2012) Daniel J. Butler, Jonas Wulff, Garrett B. Stanley, and Michael J. Black. 2012. A Naturalistic Open Source Movie for Optical Flow Evaluation. In Proceedings of the European Conference on Computer Vision.
  • Callahan et al. (2005) Steven P. Callahan, Milan Ikits, João L. D. Comba, and Cláudio T. Silva. 2005. Hardware-Assisted Visibility Sorting for Unstructured Volume Rendering. IEEE Transactions on Visualization and Computer Graphics 11, 3 (2005), 285–295.
  • Carpenter (1984) Loren Carpenter. 1984. The A-buffer, an Antialiased Hidden Surface Method. In ACM SIGGRAPH Conference Proceedings.
  • Chen et al. (2022) Anpei Chen, Zexiang Xu, Andreas Geiger, **gyi Yu, and Hao Su. 2022. TensoRF: Tensorial Radiance Fields. In Proceedings of the European Conference on Computer Vision.
  • Chen et al. (2023) Yiwen Chen, Zilong Chen, Chi Zhang, Feng Wang, ** Liu, and Guosheng Lin. 2023. GaussianEditor: Swift and Controllable 3D Editing with Gaussian Splatting. arXiv preprint arXiv:2311.14521 (2023).
  • Eldridge et al. (2000) Matthew Eldridge, Homan Igehy, and Pat Hanrahan. 2000. Pomegranate: A Fully Scalable Graphics Architecture. In ACM SIGGRAPH Conference Proceedings.
  • Fang et al. (2023) Jiemin Fang, Junjie Wang, Xiaopeng Zhang, Lingxi Xie, and Qi Tian. 2023. GaussianEditor: Editing 3D Gaussians Delicately with Text Instructions. arXiv preprint arXiv:2311.16037 (2023).
  • Fridovich-Keil et al. (2022) Sara Fridovich-Keil, Alex Yu, Matthew Tancik, Qinhong Chen, Benjamin Recht, and Angjoo Kanazawa. 2022. Plenoxels: Radiance Fields without Neural Networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
  • Hedman et al. (2018) Peter Hedman, Julien Philip, True Price, Jan-Michael Frahm, George Drettakis, and Gabriel Brostow. 2018. Deep Blending for Free-viewpoint Image-based Rendering. ACM Transactions on Graphics 37, 6, Article 257 (2018), 15 pages.
  • Jain et al. (2022) Ajay Jain, Ben Mildenhall, Jonathan T. Barron, Pieter Abbeel, and Ben Poole. 2022. Zero-Shot Text-Guided Object Generation with Dream Fields. (2022).
  • Jambon et al. (2023) Clément Jambon, Bernhard Kerbl, Georgios Kopanas, Stavros Diolatzis, George Drettakis, and Thomas Leimkühler. 2023. NeRFshop: Interactive Editing of Neural Radiance Fields. Proceedings of the ACM on Computer Graphics and Interactive Techniques 6, 1, Article 1 (2023), 21 pages.
  • Karis et al. (2021) Brian Karis, Rune Stubbe, and Graham Wihlidal. 2021. A Deep Dive into Nanite Virtualized Geometry. In ACM SIGGRAPH Conference Proceedings.
  • Kenzel et al. (2018) Michael Kenzel, Bernhard Kerbl, Dieter Schmalstieg, and Markus Steinberger. 2018. A High-Performance Software Graphics Pipeline Architecture for the GPU. ACM Transactions on Graphics 37, 4, Article 140 (2018), 15 pages.
  • Kerbl et al. (2023) Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, and George Drettakis. 2023. 3D Gaussian Splatting for Real-Time Radiance Field Rendering. ACM Transactions on Graphics 42, 4 (2023).
  • Knapitsch et al. (2017) Arno Knapitsch, Jaesik Park, Qian-Yi Zhou, and Vladlen Koltun. 2017. Tanks and Temples: Benchmarking Large-Scale Scene Reconstruction. ACM Transactions on Graphics 36, 4, Article 78 (2017), 13 pages.
  • Kuang et al. (2023) Zhengfei Kuang, Fujun Luan, Sai Bi, Zhixin Shu, Gordon Wetzstein, and Kalyan Sunkavalli. 2023. PaletteNeRF: Palette-based Appearance Editing of Neural Radiance Fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
  • Lai et al. (2018) Wei-Sheng Lai, Jia-Bin Huang, Oliver Wang, Eli Shechtman, Ersin Yumer, and Ming-Hsuan Yang. 2018. Learning Blind Video Temporal Consistency. In Proceedings of the European Conference on Computer Vision.
  • Laine and Karras (2011) Samuli Laine and Tero Karras. 2011. High-Performance Software Rasterization on GPUs. In Proceedings of the ACM SIGGRAPH Symposium on High Performance Graphics.
  • Liu et al. (2010) Fang Liu, Meng-Cheng Huang, Xue-Hui Liu, and En-Hua Wu. 2010. FreePipe: a Programmable Parallel Rendering Architecture for Efficient Multi-Fragment Effects. In Proceedings of the ACM SIGGRAPH Symposium on Interactive 3D Graphics and Games.
  • Luiten et al. (2024) Jonathon Luiten, Georgios Kopanas, Bastian Leibe, and Deva Ramanan. 2024. Dynamic 3D Gaussians: Tracking by Persistent Dynamic View Synthesis. In International Conference on 3D Vision.
  • Mildenhall et al. (2020) Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng. 2020. NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis. In Proceedings of the European Conference on Computer Vision.
  • Müller et al. (2022) Thomas Müller, Alex Evans, Christoph Schied, and Alexander Keller. 2022. Instant Neural Graphics Primitives with a Multiresolution Hash Encoding. ACM Transactions on Graphics 41, 4, Article 102 (2022), 15 pages.
  • Nguyen-Phuoc et al. (2022) Thu Nguyen-Phuoc, Feng Liu, and Lei Xiao. 2022. SNeRF: Stylized Neural Implicit Representations for 3D Scenes. ACM Transactions on Graphics 41, 4, Article 142 (2022), 11 pages.
  • Park et al. (2021) Keunhong Park, Utkarsh Sinha, Jonathan T. Barron, Sofien Bouaziz, Dan B Goldman, Steven M. Seitz, and Ricardo Martin-Brualla. 2021. Nerfies: Deformable Neural Radiance Fields. Proceedings of the IEEE/CVF International Conference on Computer Vision.
  • Patney et al. (2015) Anjul Patney, Stanley Tzeng, Kerry A Seitz Jr, and John D Owens. 2015. Piko: A Framework for Authoring Programmable Graphics Pipelines. ACM Transactions on Graphics 34, 4, Article 147 (2015), 13 pages.
  • Poole et al. (2022) Ben Poole, Ajay Jain, Jonathan T. Barron, and Ben Mildenhall. 2022. DreamFusion: Text-to-3D using 2D Diffusion. Proceedings of the International Conference on Learning Representations.
  • Pumarola et al. (2020) Albert Pumarola, Enric Corona, Gerard Pons-Moll, and Francesc Moreno-Noguer. 2020. D-NeRF: Neural Radiance Fields for Dynamic Scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
  • Raj et al. (2023) Amit Raj, Srinivas Kaza, Ben Poole, Michael Niemeyer, Ben Mildenhall, Nataniel Ruiz, Shiran Zada, Kfir Aberman, Michael Rubenstein, Jonathan Barron, Yuanzhen Li, and Varun Jampani. 2023. DreamBooth3D: Subject-Driven Text-to-3D Generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision.
  • Ruder et al. (2016) Manuel Ruder, Alexey Dosovitskiy, and Thomas Brox. 2016. Artistic Style Transfer for Videos. In Proceedings of the German Conference on Pattern Recognition.
  • Salvi et al. (2011) Marco Salvi, Jefferson Montgomery, and Aaron Lefohn. 2011. Adaptive Transparency. In Proceedings of the ACM SIGGRAPH Symposium on High Performance Graphics.
  • Salvi and Vaidyanathan (2014) Marco Salvi and Karthik Vaidyanathan. 2014. Multi-Layer Alpha Blending. In Proceedings of the ACM SIGGRAPH Symposium on Interactive 3D Graphics and Games.
  • Schönberger and Frahm (2016) Johannes L. Schönberger and Jan-Michael Frahm. 2016. Structure-from-Motion Revisited. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
  • Schütz et al. (2021) Markus Schütz, Bernhard Kerbl, and Michael Wimmer. 2021. Rendering Point Clouds with Compute Shaders and Vertex Order Optimizationn. Computer Graphics Forum 40, 4 (2021), 115–126.
  • Seiler et al. (2008) Larry Seiler, Doug Carmean, Eric Sprangle, Tom Forsyth, Michael Abrash, Pradeep Dubey, Stephen Junkins, Adam Lake, Jeremy Sugerman, Robert Cavin, et al. 2008. Larrabee: A Many-Core x86 Architecture for Visual Computing. ACM Transactions on Graphics 27, 3 (2008), 1–15.
  • Snavely et al. (2006) Noah Snavely, Steven M. Seitz, and Richard Szeliski. 2006. Photo Tourism: Exploring Photo Collections in 3D. ACM Transactions on Graphics 25, 3 (2006), 835–846.
  • Tang et al. (2022) Jiaxiang Tang, Xiaokang Chen, **gbo Wang, and Gang Zeng. 2022. Compressible-composable NeRF via Rank-residual Decomposition. Advances in Neural Information Processing Systems.
  • Tang et al. (2023) Jiaxiang Tang, Jiawei Ren, Hang Zhou, Ziwei Liu, and Gang Zeng. 2023. DreamGaussian: Generative Gaussian Splatting for Efficient 3D Content Creation. arXiv preprint arXiv:2309.16653 (2023).
  • Teed and Deng (2020) Zachary Teed and Jia Deng. 2020. RAFT: Recurrent All-Pairs Field Transforms for Optical Flow. In Proceedings of the European Conference on Computer Vision.
  • Tzeng et al. (2010) Stanley Tzeng, Anjul Patney, and John D Owens. 2010. Task Management for Irregular-Parallel Workloads on the GPU. In Proceedings of the ACM SIGGRAPH Symposium on High Performance Graphics.
  • Woolson (2008) R. F. Woolson. 2008. Wilcoxon Signed-Rank Test. John Wiley & Sons, Ltd, 1–3.
  • Wu et al. (2023) Guanjun Wu, Taoran Yi, Jiemin Fang, Lingxi Xie, Xiaopeng Zhang, Wei Wei, Wenyu Liu, Qi Tian, and Wang Xinggang. 2023. 4D Gaussian Splatting for Real-Time Dynamic Scene Rendering. arXiv preprint arXiv:2310.08528 (2023).
  • Wyman (2016) Chris Wyman. 2016. Exploring and Expanding the Continuum of OIT Algorithms. In Proceedings of the ACM SIGGRAPH Symposium on High Performance Graphics.
  • Yi et al. (2023) Taoran Yi, Jiemin Fang, Junjie Wang, Guanjun Wu, Lingxi Xie, Xiaopeng Zhang, Wenyu Liu, Qi Tian, and Xinggang Wang. 2023. GaussianDreamer: Fast Generation from Text to 3D Gaussians by Bridging 2D and 3D Diffusion Models. arXiv preprint arXiv:2310.08529 (2023).
  • Yu et al. (2023) Zehao Yu, Anpei Chen, Binbin Huang, Torsten Sattler, and Andreas Geiger. 2023. Mip-Splatting: Alias-free 3D Gaussian Splatting. arXiv preprint arXiv::2311.16493 (2023).
  • Zhang et al. (2018) Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. 2018. The Unreasonable Effectiveness of Deep Features as a Perceptual Metric. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
  • Zhou et al. (2009) Kun Zhou, Qiming Hou, Zhong Ren, Minmin Gong, Xin Sun, and Baining Guo. 2009. RenderAnts: Interactive Reyes Rendering on GPUs. ACM Transactions on Graphics 28, 5 (2009), 1–11.
  • Zielonka et al. (2023) Wojciech Zielonka, Timo Bolkart, and Justus Thies. 2023. Instant Volumetric Head Avatars. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
  • Zwicker et al. (2002) Matthias Zwicker, Hanspeter Pfister, Jeroen Van Baar, and Markus Gross. 2002. EWA Splatting. IEEE Transactions on Visualization and Computer Graphics 8, 3 (2002), 223–238.

Appendix A Deriving Depth for 3D Gaussians along a Ray

In order to get an accurate depth estimate for our sort order of 3D Gaussians along a view ray 𝐫(t)=𝐨+t𝐝𝐫𝑡𝐨𝑡𝐝\mathbf{r}(t)=\mathbf{o}+t\mathbf{d}bold_r ( italic_t ) = bold_o + italic_t bold_d, we compute the toptsubscript𝑡𝑜𝑝𝑡t_{opt}italic_t start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT which maximizes the Gaussian’s contribution along the ray, i.e. argmaxtG(𝐫(t))subscriptargmax𝑡𝐺𝐫𝑡\operatorname*{arg\,max}_{t}G(\mathbf{r}(t))start_OPERATOR roman_arg roman_max end_OPERATOR start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_G ( bold_r ( italic_t ) ). This optimum can be found through the following derivation:

dG(𝐫(t))dt𝑑𝐺𝐫𝑡𝑑𝑡\displaystyle\frac{dG(\mathbf{r}(t))}{dt}divide start_ARG italic_d italic_G ( bold_r ( italic_t ) ) end_ARG start_ARG italic_d italic_t end_ARG =12G(𝐫(t))((𝐫(t)μ)Σ1𝐝+𝐝TΣ1(𝐫(t)μ))absent12𝐺𝐫𝑡𝐫𝑡𝜇superscriptΣ1𝐝superscript𝐝𝑇superscriptΣ1𝐫𝑡𝜇\displaystyle=-\frac{1}{2}G(\mathbf{r}(t))\cdot\left((\mathbf{r}(t)-\mathbf{% \mu})\Sigma^{-1}\mathbf{d}+\mathbf{d}^{T}\Sigma^{-1}(\mathbf{r}(t)-\mathbf{\mu% })\right)= - divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_G ( bold_r ( italic_t ) ) ⋅ ( ( bold_r ( italic_t ) - italic_μ ) roman_Σ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_d + bold_d start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT roman_Σ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( bold_r ( italic_t ) - italic_μ ) )
=12G(𝐫(t))(2𝐝TΣ1(𝐫(t)μ))absent12𝐺𝐫𝑡2superscript𝐝𝑇superscriptΣ1𝐫𝑡𝜇\displaystyle=-\frac{1}{2}G(\mathbf{r}(t))\cdot\left(2\cdot\mathbf{d}^{T}% \Sigma^{-1}(\mathbf{r}(t)-\mathbf{\mu})\right)= - divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_G ( bold_r ( italic_t ) ) ⋅ ( 2 ⋅ bold_d start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT roman_Σ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( bold_r ( italic_t ) - italic_μ ) )
=G(𝐨+t𝐝)(𝐝TΣ1(𝐨+t𝐝μ))=! 0absent𝐺𝐨𝑡𝐝superscript𝐝𝑇superscriptΣ1𝐨𝑡𝐝𝜇 0\displaystyle=-G(\mathbf{o}+t\mathbf{d})\cdot\left(\mathbf{d}^{T}\Sigma^{-1}(% \mathbf{o}+t\mathbf{d}-\mathbf{\mu})\right)\overset{!}{=}\ 0= - italic_G ( bold_o + italic_t bold_d ) ⋅ ( bold_d start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT roman_Σ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( bold_o + italic_t bold_d - italic_μ ) ) over! start_ARG = end_ARG 0
𝐝TΣ1(𝐨+t𝐝μ)=0absentsuperscript𝐝𝑇superscriptΣ1𝐨𝑡𝐝𝜇0\displaystyle\Rightarrow\ \mathbf{d}^{T}\Sigma^{-1}(\mathbf{o}+t\mathbf{d}-% \mathbf{\mu})=0⇒ bold_d start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT roman_Σ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( bold_o + italic_t bold_d - italic_μ ) = 0
𝐝TΣ1𝐝t+𝐝TΣ1(𝐨μ)=0absentsuperscript𝐝𝑇superscriptΣ1𝐝𝑡superscript𝐝𝑇superscriptΣ1𝐨𝜇0\displaystyle\Rightarrow\ \mathbf{d}^{T}\Sigma^{-1}\mathbf{d}\cdot t+\mathbf{d% }^{T}\Sigma^{-1}(\mathbf{o}-\mathbf{\mu})=0⇒ bold_d start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT roman_Σ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_d ⋅ italic_t + bold_d start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT roman_Σ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( bold_o - italic_μ ) = 0
(11) toptsubscript𝑡𝑜𝑝𝑡\displaystyle t_{opt}italic_t start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT =𝐝TΣ1(μ𝐨)𝐝TΣ1𝐝.absentsuperscript𝐝𝑇superscriptΣ1𝜇𝐨superscript𝐝𝑇superscriptΣ1𝐝\displaystyle=\frac{\mathbf{d}^{T}\Sigma^{-1}(\mathbf{\mu}-\mathbf{o})}{% \mathbf{d}^{T}\Sigma^{-1}\mathbf{d}}.= divide start_ARG bold_d start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT roman_Σ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_μ - bold_o ) end_ARG start_ARG bold_d start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT roman_Σ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_d end_ARG .

The simplification from the first to the second line relies on the fact that Σ1superscriptΣ1\Sigma^{-1}roman_Σ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT is symmetric and thus both expressions are identical. Σ1superscriptΣ1\Sigma^{-1}roman_Σ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT can be efficiently computed:

Σ1=(RSSRT)1=RS1S1RT=R(sx2000sy2000sz2)RT.superscriptΣ1superscript𝑅𝑆𝑆superscript𝑅𝑇1𝑅superscript𝑆1superscript𝑆1superscript𝑅𝑇𝑅matrixsuperscriptsubscript𝑠𝑥2000superscriptsubscript𝑠𝑦2000superscriptsubscript𝑠𝑧2superscript𝑅𝑇\Sigma^{-1}=\left(RSSR^{T}\right)^{-1}=RS^{-1}S^{-1}R^{T}=R\begin{pmatrix}s_{x% }^{-2}&0&0\\ 0&s_{y}^{-2}&0\\ 0&0&s_{z}^{-2}\end{pmatrix}R^{T}.roman_Σ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT = ( italic_R italic_S italic_S italic_R start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT = italic_R italic_S start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_S start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_R start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT = italic_R ( start_ARG start_ROW start_CELL italic_s start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL italic_s start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL italic_s start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT end_CELL end_ROW end_ARG ) italic_R start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT .

Appendix B Additional Implementation Details

This section contains a more thorough description of our implementation and various optimization strategies to make our hierarchical rasterizer viable for real-time rendering.

B.1. Tile-based Culling

In Algorithm 1, we describe how to find the maximally contributing point 𝐱^^𝐱\mathbf{\hat{\mathbf{x}}}over^ start_ARG bold_x end_ARG of a 2D Gaussian G2subscript𝐺2G_{2}italic_G start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT parameterized by μ𝟐,Σ21subscript𝜇2subscriptsuperscriptΣ12\mathbf{\mu_{2}},\Sigma^{-1}_{2}italic_μ start_POSTSUBSCRIPT bold_2 end_POSTSUBSCRIPT , roman_Σ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT inside an axis-aligned tile X𝑋Xitalic_X. If μ𝟐subscript𝜇2\mathbf{\mu_{2}}italic_μ start_POSTSUBSCRIPT bold_2 end_POSTSUBSCRIPT lies inside X𝑋Xitalic_X, then it is consequently also the maximum. Otherwise, the maximum has to lie on one of the two edges that are reachable from μ𝟐subscript𝜇2\mathbf{\mu_{2}}italic_μ start_POSTSUBSCRIPT bold_2 end_POSTSUBSCRIPT. Those are the two edges that originate from the tile corner point 𝐩^^𝐩\mathbf{\hat{\mathbf{p}}}over^ start_ARG bold_p end_ARG closest to μ𝟐subscript𝜇2\mathbf{\mu_{2}}italic_μ start_POSTSUBSCRIPT bold_2 end_POSTSUBSCRIPT. We can then find the optimum by performing the same computation as in Eqn. (11), but in 2D. By checking if μ𝟐x,μ𝟐ysubscriptsubscript𝜇2𝑥subscriptsubscript𝜇2𝑦\mathbf{\mu_{2}}_{x},\mathbf{\mu_{2}}_{y}italic_μ start_POSTSUBSCRIPT bold_2 end_POSTSUBSCRIPT start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT bold_2 end_POSTSUBSCRIPT start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT are in range of the tile in x,y𝑥𝑦x,yitalic_x , italic_y direction, as well as clam** the values of tx,tysubscript𝑡𝑥subscript𝑡𝑦t_{x},t_{y}italic_t start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT to [0,1]01[0,1][ 0 , 1 ], we ensure that the final point will lie on one of these two edges. The fact that the y𝑦yitalic_y coordinate of 𝐝xsubscript𝐝𝑥\mathbf{d}_{x}bold_d start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT and the x𝑥xitalic_x coordinate of 𝐝ysubscript𝐝𝑦\mathbf{d}_{y}bold_d start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT are zero, allows for further simplifications in the final implementation.

Data: X={𝐱2|xmin𝐱𝐱xmaxymin𝐱𝐲ymax}𝑋conditional-setfor-all𝐱superscript2subscript𝑥subscript𝐱𝐱subscript𝑥subscript𝑦subscript𝐱𝐲subscript𝑦X=\{\forall\mathbf{x}\in\mathbb{R}^{2}|x_{\min}\leq\mathbf{x_{x}}\leq x_{\max}% \land y_{\min}\leq\mathbf{x_{y}}\leq y_{\max}\}italic_X = { ∀ bold_x ∈ blackboard_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | italic_x start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT ≤ bold_x start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT ≤ italic_x start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ∧ italic_y start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT ≤ bold_x start_POSTSUBSCRIPT bold_y end_POSTSUBSCRIPT ≤ italic_y start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT }
Result: 𝐱^=argmin𝐱X(𝐱μ𝟐)TΣ21(𝐱μ𝟐)\mathbf{\hat{\mathbf{x}}}=\operatorname*{arg\,min}_{\mathbf{x}\in X}(\mathbf{x% }-\mathbf{\mu_{2}})^{T}\Sigma_{2}^{-1}(\mathbf{x}-\mathbf{\mu_{2}})over^ start_ARG bold_x end_ARG = start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT bold_x ∈ italic_X end_POSTSUBSCRIPT ( bold_x - italic_μ start_POSTSUBSCRIPT bold_2 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT roman_Σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( bold_x - italic_μ start_POSTSUBSCRIPT bold_2 end_POSTSUBSCRIPT )
if μ𝟐Xsubscript𝜇2𝑋\mathbf{\mu_{2}}\in Xitalic_μ start_POSTSUBSCRIPT bold_2 end_POSTSUBSCRIPT ∈ italic_X then
       𝐱^μ𝟐^𝐱subscript𝜇2\mathbf{\hat{\mathbf{x}}}\leftarrow\mathbf{\mu_{2}}over^ start_ARG bold_x end_ARG ← italic_μ start_POSTSUBSCRIPT bold_2 end_POSTSUBSCRIPT ;
      
else
       𝐩^^𝐩absent\mathbf{\hat{\mathbf{p}}}\leftarrowover^ start_ARG bold_p end_ARG ← Corner closest to μ𝟐subscript𝜇2\mathbf{\mu_{2}}italic_μ start_POSTSUBSCRIPT bold_2 end_POSTSUBSCRIPT ;
       𝐝x,𝐝ysubscript𝐝𝑥subscript𝐝𝑦absent\mathbf{d}_{x},\mathbf{d}_{y}\leftarrowbold_d start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , bold_d start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ← vectors to next AABB corners in x,y𝑥𝑦x,yitalic_x , italic_y direction ;
       tx,ty0subscript𝑡𝑥subscript𝑡𝑦0t_{x},t_{y}\leftarrow 0italic_t start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ← 0 ;
       if μ𝟐x<xminμ𝟐x>xmaxsubscriptsubscript𝜇2𝑥expectationsubscript𝑥subscriptsubscript𝜇2𝑥subscript𝑥\mathbf{\mu_{2}}_{x}<x_{\min}\lor\mathbf{\mu_{2}}_{x}>x_{\max}italic_μ start_POSTSUBSCRIPT bold_2 end_POSTSUBSCRIPT start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT < italic_x start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT ∨ italic_μ start_POSTSUBSCRIPT bold_2 end_POSTSUBSCRIPT start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT > italic_x start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT then
             tymin(1,max(0,𝐝𝐲𝐓Σ21(μ𝟐𝐩^)𝐝𝐲𝐓Σ21𝐝y))subscript𝑡𝑦10superscriptsubscript𝐝𝐲𝐓subscriptsuperscriptΣ12subscript𝜇2^𝐩superscriptsubscript𝐝𝐲𝐓subscriptsuperscriptΣ12subscript𝐝𝑦t_{y}\leftarrow\min\left(1,\max\left(0,\frac{\mathbf{d_{y}^{T}}\Sigma^{-1}_{2}% (\mathbf{\mu_{2}}-\mathbf{\hat{\mathbf{p}}})}{\mathbf{d_{y}^{T}}\Sigma^{-1}_{2% }\mathbf{d}_{y}}\right)\right)italic_t start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ← roman_min ( 1 , roman_max ( 0 , divide start_ARG bold_d start_POSTSUBSCRIPT bold_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_T end_POSTSUPERSCRIPT roman_Σ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_μ start_POSTSUBSCRIPT bold_2 end_POSTSUBSCRIPT - over^ start_ARG bold_p end_ARG ) end_ARG start_ARG bold_d start_POSTSUBSCRIPT bold_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_T end_POSTSUPERSCRIPT roman_Σ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT bold_d start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT end_ARG ) ) ;
            
       end if
      
      if μ𝟐y<yminμ𝟐y>ymaxsubscriptsubscript𝜇2𝑦expectationsubscript𝑦subscriptsubscript𝜇2𝑦subscript𝑦\mathbf{\mu_{2}}_{y}<y_{\min}\lor\mathbf{\mu_{2}}_{y}>y_{\max}italic_μ start_POSTSUBSCRIPT bold_2 end_POSTSUBSCRIPT start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT < italic_y start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT ∨ italic_μ start_POSTSUBSCRIPT bold_2 end_POSTSUBSCRIPT start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT > italic_y start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT then
             txmin(1,max(0,𝐝𝐱𝐓Σ21(μ𝟐𝐩^)𝐝𝐱𝐓Σ21𝐝x))subscript𝑡𝑥10superscriptsubscript𝐝𝐱𝐓subscriptsuperscriptΣ12subscript𝜇2^𝐩superscriptsubscript𝐝𝐱𝐓subscriptsuperscriptΣ12subscript𝐝𝑥t_{x}\leftarrow\min\left(1,\max\left(0,\frac{\mathbf{d_{x}^{T}}\Sigma^{-1}_{2}% (\mathbf{\mu_{2}}-\mathbf{\hat{\mathbf{p}}})}{\mathbf{d_{x}^{T}}\Sigma^{-1}_{2% }\mathbf{d}_{x}}\right)\right)italic_t start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ← roman_min ( 1 , roman_max ( 0 , divide start_ARG bold_d start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_T end_POSTSUPERSCRIPT roman_Σ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_μ start_POSTSUBSCRIPT bold_2 end_POSTSUBSCRIPT - over^ start_ARG bold_p end_ARG ) end_ARG start_ARG bold_d start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_T end_POSTSUPERSCRIPT roman_Σ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT bold_d start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_ARG ) ) ;
            
       end if
      𝐱^𝐩^+tx𝐝x+ty𝐝y^𝐱^𝐩subscript𝑡𝑥subscript𝐝𝑥subscript𝑡𝑦subscript𝐝𝑦\mathbf{\hat{\mathbf{x}}}\leftarrow\mathbf{\hat{\mathbf{p}}}+t_{x}\mathbf{d}_{% x}+t_{y}\mathbf{d}_{y}over^ start_ARG bold_x end_ARG ← over^ start_ARG bold_p end_ARG + italic_t start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT bold_d start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT + italic_t start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT bold_d start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ;
      
end if
μ𝟐,Σ21subscript𝜇2subscriptsuperscriptΣ12\mathbf{\mu_{2}},\Sigma^{-1}_{2}italic_μ start_POSTSUBSCRIPT bold_2 end_POSTSUBSCRIPT , roman_Σ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT: mean and inverse covariance matrix of 2D Gaussian G2subscript𝐺2G_{2}italic_G start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT
xmin,xmax,ymin,ymaxsubscript𝑥subscript𝑥subscript𝑦subscript𝑦x_{\min},x_{\max},y_{\min},y_{\max}italic_x start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT: AABB dimensions
ALGORITHM 1 Finding maximum of 2D Gaussian inside AABB
μ𝟐,Σ21subscript𝜇2subscriptsuperscriptΣ12\mathbf{\mu_{2}},\Sigma^{-1}_{2}italic_μ start_POSTSUBSCRIPT bold_2 end_POSTSUBSCRIPT , roman_Σ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT: mean and inverse covariance matrix of 2D Gaussian G2subscript𝐺2G_{2}italic_G start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT
xmin,xmax,ymin,ymaxsubscript𝑥subscript𝑥subscript𝑦subscript𝑦x_{\min},x_{\max},y_{\min},y_{\max}italic_x start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT: AABB dimensions

B.2. Tighter Bounding of 2D Gaussians

For computing the bounding rectangle of touched tiles on screen, Kerbl et al. (2023) first bound each 2D Gaussian with a circle of radius r=3λmax𝑟3subscript𝜆r=3\cdot\lambda_{\max}italic_r = 3 ⋅ italic_λ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT, where λmaxsubscript𝜆\lambda_{\max}italic_λ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT denotes the largest eigenvalue of the 2D covariance matrix Σ2subscriptΣ2\Sigma_{2}roman_Σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. They use a constant factor tO=3subscript𝑡𝑂3t_{O}=3italic_t start_POSTSUBSCRIPT italic_O end_POSTSUBSCRIPT = 3 as a bound for a Gaussian, effectively clip** it at 0.3% of its peak value. We instead calculate an exact bound by considering the Gaussian’s actual opacity value α𝛼\alphaitalic_α and compute tO=2log(αϵO)subscript𝑡𝑂2𝛼subscriptitalic-ϵ𝑂t_{O}=\sqrt{2\log(\frac{\alpha}{\epsilon_{O}})}italic_t start_POSTSUBSCRIPT italic_O end_POSTSUBSCRIPT = square-root start_ARG 2 roman_log ( divide start_ARG italic_α end_ARG start_ARG italic_ϵ start_POSTSUBSCRIPT italic_O end_POSTSUBSCRIPT end_ARG ) end_ARG, which is itself upper bounded by tOmax3.3290subscriptsubscript𝑡𝑂3.3290{t_{O}}_{\max}\approx 3.3290italic_t start_POSTSUBSCRIPT italic_O end_POSTSUBSCRIPT start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ≈ 3.3290 (since α[0,1]𝛼01\alpha\in[0,1]italic_α ∈ [ 0 , 1 ]). Therefore, we conclude that the bound of tO=3subscript𝑡𝑂3t_{O}=3italic_t start_POSTSUBSCRIPT italic_O end_POSTSUBSCRIPT = 3 by Kerbl et al. (2023) was actually chosen too small for the opacity threshold ϵO=1255subscriptitalic-ϵ𝑂1255\epsilon_{O}=\frac{1}{255}italic_ϵ start_POSTSUBSCRIPT italic_O end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG 255 end_ARG used in the renderer. Additionally, our calculated bound allows us to fit a tighter circular bound around Gaussians with α<1𝛼1\alpha<1italic_α < 1.

B.3. Global Sort

Using a giant global sort for all combined (tile/depth) keys seems wasteful. Sorting would be more efficient if the entries of each tile would be sorted individually, using a global partitioned sort. However, this requires all the entries of a tile to be continuous in memory, with each tile knowing the range of its respective entries. We can create such a setup by counting the number of entries per tile during the Preprocess stage with an atomic counter per tile and computing tile ranges with a prefix sum. In the Duplication stage, another atomic counter per tile can be used to retrieve offsets for each entry inside this range. While this reduces sorting costs to less than half in our experiments, the allocation using atomic operations adds an overhead that is about equal to the time saved in sorting. Thus, we opted to keep the original sorting approach.

B.4. Per-stage details

Preprocess and Duplication

Similarly to 3DGS, we also prepare common values for each Gaussian during Preprocess: We compute and store G2subscript𝐺2G_{2}italic_G start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT for every Gaussian, evaluate Spherical Harmonics relying on the direction from the camera to the Gaussians center as view direction, establish Σ1superscriptΣ1\Sigma^{-1}roman_Σ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT relying on the specifics of R𝑅Ritalic_R and S𝑆Sitalic_S, and precompute Σ1(μ𝐨)superscriptΣ1𝜇𝐨\Sigma^{-1}(\mathbf{\mu}-\mathbf{o})roman_Σ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_μ - bold_o ) for the current camera position 𝐨𝐨\mathbf{o}bold_o, packing the 6 unique coefficients of Σ1superscriptΣ1\Sigma^{-1}roman_Σ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT with the precomputed vector for efficient loading.

We found that activating “fast math” in combination with re-scheduling in Preprocess and Duplication may lead to slightly different ordering of floating point instructions. Thus, there may be slight differences in the number of tiles contributed by each Gaussian. As we already store the number of tiles contributed by every Gaussian for memory allocation, we rely on the following simple solution: during Preprocess we use a slightly lower threshold for culling, providing a slightly more conservative bound. During Duplication, we recheck whether the right number of tile contributions have been written. If this is not the case, we simply add a dummy entry that sets a higher tile id and depth to \infty. For training, we suggest to disable “fast math”, ensuring that gradient computations are as stable as possible. However, for rendering using “fast math” may be beneficial to squeeze even more performance.

For load balancing in both Preprocess and Duplication, we rely on the ballot instruction to determine which threads still require computations. We use shuffle operations to broadcast already loaded register values, so each thread can perform culling and depth evaluation without additional memory loads. We assign successive potential tiles to each thread according to their thread rank in the warp. For every iteration of the inner loop we again ballot to determine which threads in the warp still want to write to a tile, i.e. did not cull away their tile. We can then mask all ballot bits of lower ranked threads, compute their sum via popc and determine the write location for each thread.

Render

Our hierarchical rasterizer is constructed from many steps, which are interleaved in their operation. Due to the setup, there are special optimizations we can perform based on the current state of the pipeline: The pipeline starts out with an initialize phase for each level, establishing a minimal fill level for each where no merge sort is performed. In this phase, blending is not taking place either. During the main operation, we ensure that we maintain a minimal fill level for each queue. Finally, the pipeline is drained where the number of elements in each queue will eventually drop to zero. Furthermore, we know that certain parts of the pipeline will always be executed a specific number of times. The combination of these facts allows for a significant amount of specialization and loop unrolling. However, we found that excessive code specialization and unrolling leads to a significant amount of stalls due to instruction fetches. Thus, relying on less specialized code is overall beneficial although up to 15% more instructions are required for the increased control logics.

For Batcher Merge Sort, we use a trivial implementation adapted from the NVIDIA CUDA examples111https://github.com/NVIDIA/cuda-samples. For Merge Sort, we use a custom implementation that is adapted for our use case: each thread holds the to-be-inserted elements in registers and runs a binary search through the existing array to find where the new element should be placed with respect to the existing data. In combination with the thread’s rank, this yields the position in the final sorted array. Still, we need to update the position of the existing data. To this end, we switch the roles and memory locations of both data arrays and perform the exact same binary search, only switching strict comparison to non-strict comparison. Also note that we are operating on a small fixed size array, enabling loop unrolling and leading to very few memory accesses. For local presorting of four elements, we simply run three circular shuffles, revealing all elements among all threads to directly yield the right order via simple counting of smaller elements. In our tests this was faster than any other method.

As we reevaluate toptsubscript𝑡𝑜𝑝𝑡t_{opt}italic_t start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT many times for many different ray directions, constructing and normalizing view rays can become a bottleneck. Precomputing all view directions a single thread will need throughout the hierarchy (two for the 4×4444{\times}44 × 4 tile-queue, one for the 2×2222{\times}22 × 2 tile-queue and one for the per-pixel queue) would result in significant register pressure. Fortunately, the same directions are needed by different threads and we can store the directions in shared memory and fetch them on demand, leading to significant performance improvements.

Obviously, we need to take some care to ensure that threads do not diverge, especially, we can only retire queues if all threads in the associated tile are done. Also note that the loaded batches remain in registers for a potentially long time — a 16 batch loaded by a half warp remains in registers while four 4-thread batches are loaded and potentially up to 16 elements are blended. However, when the 32-wide batch is loaded, no smaller batches are kept alive, somewhat reducing register pressure.

Appendix C Pop** Detection Metric

For our pop** detection metric, we use the RAFT (Teed and Deng, 2020) model pre-trained on SINTEL (Butler et al., 2012), which is publicly available. We also compute the optical flow separately for each method for a fair comparison. We follow Nguyen-Phuoc et al. (2022) with timesteps t{1,7}𝑡17t\in\{1,7\}italic_t ∈ { 1 , 7 } to measure short-range and long-range view-consistency, respectively. We provide an additional ablation study for different t{3,5,9}𝑡359t\in\{3,5,9\}italic_t ∈ { 3 , 5 , 9 } in Tab. 8, with three camera paths for the Garden scene of Mip-NeRF 360 (Barron et al., 2022). As can be seen, the consistency error grows almost linearly with increasing t𝑡titalic_t. Further, our method outperforms 3DGS for all timesteps.

Table 8.

F

LIP
t
subscript

F

LIP
𝑡
\text{\reflectbox{F}LIP}_{t}LIP start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT
comparison for t{1,3,5,7,9}𝑡13579t\in\{1,3,5,7,9\}italic_t ∈ { 1 , 3 , 5 , 7 , 9 } for three camera paths for the Garden scene of Mip-NeRF 360 (Barron et al., 2022). As can be seen, our method outperforms 3DGS for each t𝑡titalic_t, and

F

LIP
t
subscript

F

LIP
𝑡
\text{\reflectbox{F}LIP}_{t}LIP start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT
scales almost linearly with increasing t𝑡titalic_t.
Method

F

LIP
1
subscript

F

LIP
1
\text{\reflectbox{F}LIP}_{1}LIP start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT

F

LIP
3
subscript

F

LIP
3
\text{\reflectbox{F}LIP}_{3}LIP start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT

F

LIP
5
subscript

F

LIP
5
\text{\reflectbox{F}LIP}_{5}LIP start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT

F

LIP
7
subscript

F

LIP
7
\text{\reflectbox{F}LIP}_{7}LIP start_POSTSUBSCRIPT 7 end_POSTSUBSCRIPT

F

LIP
9
subscript

F

LIP
9
\text{\reflectbox{F}LIP}_{9}LIP start_POSTSUBSCRIPT 9 end_POSTSUBSCRIPT
3DGS 0.0080 0.0109 0.0134 0.0157 0.0180
Ours 0.0075 0.0080 0.0082 0.0085 0.0087
Per-Frame Results.

To gain more insight into our proposed pop** detection metric, we additionally provide per-frame plots for a video of the Garden scene in Fig. 11. As can clearly be seen, there are significant peaks in

F

LIP
1
subscript

F

LIP
1
\text{\reflectbox{F}LIP}_{1}LIP start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT
for 3DGS, caused by pop**. Our method, on the other hand, does not suffer from such issues. When analyzing the plot for

F

LIP
7
subscript

F

LIP
7
\text{\reflectbox{F}LIP}_{7}LIP start_POSTSUBSCRIPT 7 end_POSTSUBSCRIPT
, 3DGS obtains significantly higher error rates — using t=7𝑡7t=7italic_t = 7 accumulates artifacts over several iterations, therefore more clearly indicating pop** when averaged over the complete video sequence.

F

LIP
1
subscript

F

LIP
1
\text{\reflectbox{F}LIP}_{1}LIP start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT

F

LIP
7
subscript

F

LIP
7
\text{\reflectbox{F}LIP}_{7}LIP start_POSTSUBSCRIPT 7 end_POSTSUBSCRIPT

Refer to caption
Figure 11. Per-frame

F

LIP
t
subscript

F

LIP
𝑡
\text{\reflectbox{F}LIP}_{t}LIP start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT
scores for t{1,7}𝑡17t\in\{1,7\}italic_t ∈ { 1 , 7 } for a complete video sequence from the Garden scene. Pop** in 3DGS causes significant peaks, as can be seen in the results for

F

LIP
1
subscript

F

LIP
1
\text{\reflectbox{F}LIP}_{1}LIP start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT
.
3DGS Cheating.

To support our claim that 3DGS indeed cheats with pop** to produce view-dependent effects, we provide additional images in Fig. 12. We choose a ground-truth view from Train and Garden and sample a random rotation from [0.5,0.5]3superscriptsuperscript0.5superscript0.53[-0.5^{\circ},0.5^{\circ}]^{3}[ - 0.5 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT , 0.5 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT, which we apply to the ground-truth camera rotation. Subsequently, we compare the rendering from the ground-truth camera pose and the rendering from the slightly rotated pose for 3DGS, as well as our method.

As can be seen, our approach produces more consistent results under view rotation. Due to 3DGS’s pop**, the appearance changes significantly around test set views, which results in better image metrics in some configurations. In Fig. 12, we increase contrast for the zoomed-in views and provide F LIP comparisons to more clearly illustrate view inconsistencies.

Ground Truth Images

Contrast \uparrow

F LIP

Original

Rotated

Refer to caption
Figure 12. 3DGS can fake view-dependent effects with pop**. We slightly rotate test set views, and 3DGS’s results are significantly less consistent compared to our results. We increase contrast for zoomed-in views and include a F LIP view for a better comparison.

Appendix D User Study

For our user study we recruited 18 participants from a local university, age 26 to 34, all normal or corrected vision, no color blindness. All participants indicated that they are familiar with computer graphics (3-5 on a 5-point Likert scale).

We pre-recorded camera paths for all 13 scenes, looking at the main object present in the scene. For 3DGS and ours, we used the version specifically trained for these approaches without Opacity Decay. The paths all exhibit translation and rotation. The recorded video clips were between 8888 and 19191919 seconds long.

After a pre-questionnaire, we instructed the participants that they will be presented with video pairs and they should specifically look for consistency in the rendering and then rate whether either of the video clips was more consistent than the other. If they did not consider any clip more consistent, they were allowed to rate them as equal. We mapped those answers onto scores s𝑠sitalic_s:

s={13DGS is more consistent,0both are equal,1ours is more consistent.𝑠cases13DGS is more consistent0both are equal1ours is more consistents=\begin{cases}-1&\text{3DGS is more consistent},\\ \phantom{-}0&\text{both are equal},\\ \phantom{-}1&\text{ours is more consistent}.\\ \end{cases}italic_s = { start_ROW start_CELL - 1 end_CELL start_CELL 3DGS is more consistent , end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL both are equal , end_CELL end_ROW start_ROW start_CELL 1 end_CELL start_CELL ours is more consistent . end_CELL end_ROW

We presented both videos side-by-side and played them in a loop. We did not restrict the answer times, allowing participants to watch the clips repeatedly. We randomized the order of videos (left, right) as well as the order of scenes.

Overall, participants considered our method more consistent in 54.3%percent54.354.3\%54.3 % of the cases, voted for equal in 33.3%percent33.333.3\%33.3 % and preferred 3DGS in 12.4%percent12.412.4\%12.4 %, leading to an average preference score of smean=0.42subscript𝑠𝑚𝑒𝑎𝑛0.42s_{mean}=0.42italic_s start_POSTSUBSCRIPT italic_m italic_e italic_a italic_n end_POSTSUBSCRIPT = 0.42. The result is statistically significant according to Wilcoxon Signed Rank tests (Z=2276.5𝑍2276.5Z=2276.5italic_Z = 2276.5, p<.0001𝑝.0001p<.0001italic_p < .0001(Woolson, 2008). As can be seen in Fig. 13, we observe inter-scene differences. For scenes with mostly small Gaussians, like in Bonsai or Kitchen, we expected less difference in the voted scores, as there is also less pop**. In contrast, for scenes with large Gaussians, where pop** occurs more often, like Room, Train or Truck, it is not surprising that our method is preferred by a large margin. We were not able to assess why participants slightly preferred 3DGS for Bicycle.

Refer to caption
Figure 13. Average per-scene user study score. A positive score indicates a preference for our method, whereas a negative score indicates a preference for 3DGS. Our method clearly outperforms 3DGS.

Appendix E Detailed Performance Timings

In this section, we provide additional performance ablation studies. We follow the evaluation setup from the main material, interpolating between all available COLMAP poses (30 frames per pose), and rendering in Full HD on an NVIDIA RTX 4090 with CUDA 11.8.

Per-Scene Performance Timings.

In Tab. 9, we show per-scene performance timings for the total render time in ms. For the Mip-NeRF 360 (Barron et al., 2022) Indoor and Outdoor scenes, our method is slightly slower than 3DGS. For the Tanks & Temples (Knapitsch et al., 2017) and Deep Blending (Hedman et al., 2018) datasets, we achieve higher performance than 3DGS for most scenes. Analyzing the performance in more detail, we could verify that our method outperforms 3DGS when Gaussian are larger and/or more anisotropic, as our culling and load balancing can speed up rendering. If Gaussians are small and uniformly sized, the main load stems from the final stages of the render kernel, where sorting of course creates an overhead compared to 3DGS.

Table 9. Total performance timings for different configurations of our method and 3DGS, with the respective number of Gaussians per scene for comparison. Although scenes may exhibit a similar number of Gaussians, performance timings vary significantly.
Scene Bicycle Flowers Garden Stump Treehill
#Gaussians 5.95M 3.60M 5.49M 4.84M 3.85M
(A) Ours 6.829 4.921 7.247 4.693 5.012
(B) Ours w/o per-tile depth 6.730 4.693 7.160 4.509 4.879
(C) Ours w/o load balancing 8.482 6.732 9.167 6.496 6.919
(D) Ours w/o tile-based culling 10.066 7.338 9.796 6.584 7.884
(E) Ours w/o hier. culling 11.087 7.589 11.788 7.178 7.773
3DGS 7.438 4.002 6.034 3.708 4.492
Scene Bonsai Counter Kitchen Room
#Gaussians 1.25M 1.20M 1.81M 1.55M
(A) Ours 3.399 4.390 5.695 3.990
(B) Ours w/o per-tile depth 3.285 4.250 5.587 3.844
(C) Ours w/o load balancing 5.251 6.217 7.558 5.843
(D) Ours w/o tile-based culling 4.846 6.977 8.214 6.155
(E) Ours w/o hier. culling 4.608 6.142 8.916 5.450
3DGS 2.574 4.043 4.783 4.180
Dataset Deep Blending Tanks & Temples
Scene DrJohnson Playroom Train Truck
#Gaussians 3.28M 2.33M 1.05M 2.56M
(A) Ours 4.763 4.549 4.225 5.100
(B) Ours w/o per-tile depth 4.612 4.373 4.099 4.898
(C) Ours w/o load balancing 6.648 6.275 6.215 6.942
(D) Ours w/o tile-based culling 7.998 7.295 7.469 8.363
(E) Ours w/o hier. culling 6.418 5.999 5.675 7.113
3DGS 5.752 4.303 5.548 5.506
Relative Performance Timings.

In Tab. 10, we report per-stage performance timings of our method relative to 3DGS for each scene. Preprocess is generally slower due to the additional workload of tile-based culling and computation of Σ1superscriptΣ1\Sigma^{-1}roman_Σ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT. Due to our load balancing strategy, our Duplicate stage is faster for every tested scene, except for Flowers — here, our load balancing scheme is not able to amortize the additional workload of per-tile depth evaluations and tile-based culling. Sort is accelerated drastically, as the modifications for the previous stages result in fewer 2D splats to sort. The Render stage is naturally slower due to the overhead of our hierarchical rasterizer.

Table 10. Relative per-scene performance timings of our method with respect to 3DGS for each stage. Shades of blue indicate scenes where our method performed favorably, whereas shades of red indicate the opposite. We also report average percentages in the final row (not average runtime).
Dataset Scene #Gaussians Preprocess Duplicate Sort Render Total
M360 Outdoor Bicycle 5.95M 1.38 0.65 0.19 1.40 0.92
Flowers 3.60M 1.53 1.21 0.25 1.71 1.23
Garden 5.49M 1.27 0.78 0.29 2.04 1.20
Stump 4.84M 1.52 0.90 0.26 1.82 1.27
Treehill 3.85M 1.45 0.88 0.22 1.70 1.12
M360 Indoor Bonsai 1.25M 1.32 0.84 0.25 2.06 1.32
Counter 1.20M 1.46 0.75 0.17 1.80 1.09
Kitchen 1.81M 1.25 0.66 0.25 2.02 1.19
Room 1.55M 1.60 0.70 0.13 1.53 0.95
DB DrJohnson 3.28M 1.60 0.62 0.12 1.36 0.83
Playroom 2.33M 1.70 0.93 0.14 1.63 1.06
T&T Train 1.05M 1.42 0.61 0.11 1.53 0.76
Truck 2.56M 1.57 0.79 0.13 1.69 0.93
Average 2.98M 1.47 0.79 0.19 1.71 1.07

As previously discussed, Bonsai and Train exhibit the biggest inter-method differences despite the similar number of Gaussians due to differently sized splats in screen space. Bicycle is the only scene of Mip-NeRF 360 Outdoor where our approach outperforms 3DGS. Although Garden has a similarly high number of Gaussians, Bicycle allows for more excessive culling and, therefore, has a significantly lower avg. Ntsubscript𝑁𝑡N_{t}italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT for our method (2.72.72.72.7 vs. 2.412.412.412.41) compared to 3DGS (6.246.246.246.24 vs. 4.454.454.454.45).

Hyperparameter Performance Ablation.

In Tab. 11, we compare timings for the Render stage for different size combinations of the 2×2222{\times}22 × 2 tile-queue and per-pixel queue. As expected, our final chosen values (8 and 4) perform best, since they have the lowest memory and compute requirements. Runtime increases significantly for very large per-pixel queue sizes, as register pressure increases and spills into local memory. In Tab. 11, we also show performance timings of the Preprocess and Duplicate stages for different load balancing tile thresholds (=absent\infty=∞ = no load balancing). Gaussians whose 2D splat covers more tiles than this threshold will be computed cooperatively by all threads of a warp. Our chosen value of 32 performs well for both stages. Preprocess, which only performs tile-based culling, does not show large differences and is even slightly faster for small thresholds. Duplicate shows significant improvements, as it performs many more computations per tile, i.e. tile-based culling, per-tile depth calculation, and key/value writes. Choosing a value of 32 also fits GPU warp sizes, and allows for better warp utilization, as well as fast concurrent memory writes.

Appendix F Per-Scene Quality Metrics

We provide per-scene results for Mip-NeRF 360 (Barron et al., 2022), Tanks and Temples (Knapitsch et al., 2017) and Deep Blending (Hedman et al., 2018) in Tabs. 12 and 13. Results with dagger (\dagger) were reproduced from Kerbl et al. (2023): this includes results for Mip-NeRF 360 (Barron et al., 2022), Instant-NGP (Müller et al., 2022) and Plenoxels (Fridovich-Keil et al., 2022). We evaluate our final hierarchical rasterizer (“Ours”), as well as the fixed-size head sorting method for two different resorting window sizes (“Head 8” and “Head 16”), with and without per-tile depth (“w/o PTD”).

Table 11. Ablation of our method for different sizes of the 2×2222{\times}22 × 2 tile-queue and the per-pixel queue in addition to different load balancing thresholds, with the same underlying model (no retraining).
2×2222{\times}22 × 2 tile-queue 8 8 12 12 20 20
per-pixel queue 4 8 4 8 8 16
Render 3.646 4.000 4.052 4.364 4.470 8.549
Threshold 1 4 16 32 64 256 \infty
Preprocess 0.644 0.643 0.648 0.648 0.649 0.652 0.847
Duplicate 0.607 0.541 0.437 0.435 0.447 0.515 2.059
Table 12. Combined per-scene scores for PSNR, SSIM, LPIPS and F LIP for Tanks & Temples and Deep Blending.
Metric PSNR SSIM
Dataset Tanks & Temples Deep Blending Tanks & Temples Deep Blending
Scene Truck Train DrJ. Playroom Truck Train DrJ. Playroom
Mip-NeRF 360\dagger 24.91 19.52 29.14 29.66 0.857 0.660 0.901 0.900
Instant-NGP (base)\dagger 23.26 20.17 27.75 19.48 0.779 0.666 0.839 0.754
Instant-NGP (big)\dagger 23.38 20.46 28.26 21.67 0.800 0.689 0.854 0.780
Plenoxels 23.23 18.94 23.16 23.02 0.774 0.663 0.787 0.802
3DGS 25.39 22.04 29.06 29.86 0.878 0.813 0.898 0.901
Head 8 w/o PTD 24.79 21.52 29.40 30.29 0.877 0.809 0.902 0.905
Head 8 24.81 21.41 29.51 30.31 0.878 0.810 0.904 0.905
Head 16 w/o PTD 24.84 21.60 29.40 30.36 0.878 0.810 0.904 0.905
Head 16 24.81 21.36 29.44 30.31 0.877 0.809 0.903 0.906
Ours w/o PTD 24.93 21.53 29.44 30.31 0.878 0.810 0.903 0.905
Ours 24.93 21.48 29.42 30.31 0.878 0.808 0.903 0.905
3DGS (Opacity Decay) 25.31 21.73 28.18 29.69 0.874 0.804 0.888 0.899
Ours (Opacity Decay) 24.90 21.46 29.38 30.30 0.875 0.804 0.903 0.907
LPIPS

F

LIP
Dataset Tanks & Temples Deep Blending Tanks & Temples Deep Blending
Scene Truck Train DrJ. Playroom Truck Train DrJ. Playroom
Mip-NeRF 360\dagger 0.159 0.354 0.237 0.252 0.162 0.302 0.117 0.158
Instant-NGP (base)\dagger 0.274 0.386 0.381 0.465 0.194 0.297 0.141 0.375
Instant-NGP (big)\dagger 0.249 0.360 0.352 0.428 0.190 0.291 0.133 0.311
Plenoxels 0.308 0.379 0.433 0.418 0.196 0.328 0.222 0.266
3DGS 0.148 0.208 0.247 0.246 0.148 0.250 0.119 0.143
Head 8 w/o PTD 0.143 0.204 0.236 0.237 0.165 0.265 0.116 0.140
Head 8 0.142 0.203 0.234 0.235 0.166 0.266 0.114 0.139
Head 16 w/o PTD 0.142 0.203 0.234 0.236 0.166 0.262 0.115 0.138
Head 16 0.142 0.203 0.234 0.235 0.164 0.267 0.116 0.139
Ours w/o PTD 0.142 0.204 0.234 0.235 0.163 0.264 0.115 0.139
Ours 0.142 0.204 0.234 0.235 0.164 0.267 0.115 0.138
3DGS (Opacity Decay) 0.160 0.228 0.265 0.260 0.148 0.261 0.124 0.144
Ours (Opacity Decay) 0.151 0.218 0.241 0.241 0.160 0.267 0.115 0.138
Table 13. Combined per-scene scores for PSNR, SSIM, LPIPS & F LIP for the Mip-NeRF 360 dataset.
Dataset Mip-NeRF 360 Outdoor Mip-NeRF 360 Indoor
Scene Bicycle Flowers Garden Stump Treehill Room Counter Kitchen Bonsai
PSNR
Mip-NeRF 360\dagger 24.30 21.65 26.88 26.36 22.93 31.47 29.45 31.99 33.40
Instant-NGP (base)\dagger 22.19 20.35 24.60 23.63 22.36 29.27 26.44 28.55 30.34
Instant-NGP (big)\dagger 22.17 20.65 25.07 23.47 22.37 29.69 26.69 29.48 30.69
Plenoxels 21.90 20.10 23.50 20.68 22.26 27.57 23.64 23.43 24.71
3DGS 25.18 21.48 27.24 26.62 22.45 31.49 28.98 31.35 32.10
Head 8 w/o PTD 25.18 21.49 27.14 26.64 22.41 30.77 28.83 31.06 31.85
Head 8 25.19 21.50 27.20 26.62 22.52 30.88 28.78 31.04 31.98
Head 16 w/o PTD 25.20 21.48 27.18 26.62 22.45 30.84 28.84 30.89 31.63
Head 16 25.22 21.55 27.12 26.59 22.50 30.81 28.78 31.06 31.88
Ours w/o PTD 25.21 21.45 27.17 26.68 22.47 30.84 28.70 31.23 31.90
Ours 25.20 21.50 27.16 26.69 22.43 30.83 28.59 31.13 31.93
3DGS (Opacity Decay) 24.93 21.30 27.05 26.57 22.39 31.03 28.64 31.07 31.52
Ours (Opacity Decay) 25.00 21.30 26.95 26.67 22.39 30.58 28.33 30.46 30.76
SSIM
Mip-NeRF 360\dagger 0.685 0.584 0.809 0.745 0.631 0.910 0.892 0.917 0.938
Instant-NGP (base)\dagger 0.491 0.450 0.649 0.574 0.518 0.855 0.798 0.818 0.890
Instant-NGP (big)\dagger 0.512 0.486 0.701 0.594 0.542 0.871 0.817 0.858 0.906
Plenoxels 0.495 0.432 0.606 0.523 0.510 0.840 0.758 0.648 0.814
3DGS 0.763 0.603 0.862 0.772 0.632 0.917 0.906 0.925 0.939
Head 8 w/o PTD 0.766 0.602 0.862 0.773 0.633 0.917 0.905 0.925 0.939
Head 8 0.766 0.604 0.862 0.773 0.634 0.916 0.905 0.924 0.939
Head 16 w/o PTD 0.767 0.603 0.861 0.773 0.633 0.917 0.905 0.922 0.939
Head 16 0.767 0.604 0.861 0.773 0.635 0.917 0.905 0.925 0.939
Ours w/o PTD 0.767 0.603 0.862 0.775 0.635 0.917 0.904 0.925 0.939
Ours 0.767 0.604 0.862 0.775 0.635 0.917 0.903 0.925 0.939
3DGS (Opacity Decay) 0.749 0.592 0.854 0.770 0.626 0.914 0.899 0.921 0.937
Ours (Opacity Decay) 0.756 0.593 0.855 0.775 0.629 0.914 0.898 0.920 0.935
Dataset Mip-NeRF 360 Outdoor Mip-NeRF 360 Indoor
Scene Bicycle Flowers Garden Stump Treehill Room Counter Kitchen Bonsai
LPIPS
Mip-NeRF 360\dagger 0.305 0.346 0.171 0.261 0.347 0.213 0.207 0.128 0.179
Instant-NGP (base)\dagger 0.487 0.481 0.312 0.450 0.489 0.301 0.342 0.254 0.227
Instant-NGP (big)\dagger 0.446 0.441 0.257 0.421 0.450 0.261 0.306 0.205 0.193
Plenoxels 0.490 0.506 0.374 0.468 0.495 0.344 0.378 0.404 0.336
3DGS 0.213 0.338 0.109 0.216 0.327 0.221 0.202 0.127 0.206
Head 8 w/o PTD 0.207 0.336 0.107 0.211 0.322 0.216 0.199 0.126 0.203
Head 8 0.207 0.335 0.107 0.211 0.320 0.217 0.199 0.126 0.202
Head 16 w/o PTD 0.206 0.336 0.107 0.211 0.321 0.216 0.198 0.128 0.203
Head 16 0.206 0.335 0.107 0.211 0.319 0.216 0.199 0.126 0.202
Ours w/o PTD 0.205 0.335 0.107 0.210 0.319 0.216 0.199 0.126 0.203
Ours 0.206 0.335 0.107 0.210 0.319 0.216 0.200 0.126 0.202
3DGS (Opacity Decay) 0.244 0.358 0.125 0.232 0.347 0.230 0.215 0.137 0.210
Ours (Opacity Decay) 0.232 0.354 0.122 0.224 0.336 0.224 0.211 0.135 0.207

F

LIP
Mip-NeRF 360\dagger 0.169 0.217 0.124 0.156 0.184 0.095 0.100 0.088 0.069
Instant-NGP (base)\dagger 0.203 0.260 0.155 0.209 0.189 0.118 0.144 0.123 0.093
Instant-NGP (big)\dagger 0.201 0.251 0.146 0.213 0.189 0.112 0.139 0.113 0.089
Plenoxels 0.211 0.271 0.181 0.276 0.206 0.143 0.201 0.218 0.165
3DGS 0.158 0.225 0.118 0.150 0.186 0.093 0.105 0.096 0.082
Head 8 w/o PTD 0.160 0.223 0.120 0.150 0.184 0.102 0.107 0.100 0.086
Head 8 0.159 0.223 0.119 0.150 0.181 0.101 0.108 0.099 0.083
Head 16 w/o PTD 0.159 0.224 0.119 0.151 0.182 0.101 0.107 0.103 0.086
Head 16 0.159 0.222 0.121 0.151 0.183 0.102 0.108 0.099 0.085
Ours w/o PTD 0.160 0.225 0.120 0.149 0.183 0.101 0.110 0.099 0.085
Ours 0.159 0.224 0.119 0.149 0.184 0.101 0.111 0.099 0.084
3DGS (Opacity Decay) 0.162 0.228 0.120 0.151 0.182 0.096 0.107 0.099 0.085
Ours (Opacity Decay) 0.162 0.228 0.122 0.148 0.182 0.103 0.112 0.106 0.090