SC-GS: Sparse-Controlled Gaussian Splatting for Editable Dynamic Scenes

Yi-Hua Huang1#*  Yang-Tian Sun1#*  Ziyi Yang3*  Xiaoyang Lyu1  Yan-Pei Cao2\dagger  Xiaojuan Qi1\dagger
1 The University of Hong Kong  2 VAST  3 Zhejiang University
Abstract

Novel view synthesis for dynamic scenes is still a challenging problem in computer vision and graphics. Recently, Gaussian splatting has emerged as a robust technique to represent static scenes and enable high-quality and real-time novel view synthesis. Building upon this technique, we propose a new representation that explicitly decomposes the motion and appearance of dynamic scenes into sparse control points and dense Gaussians, respectively. Our key idea is to use sparse control points, significantly fewer in number than the Gaussians, to learn compact 6 DoF transformation bases, which can be locally interpolated through learned interpolation weights to yield the motion field of 3D Gaussians. We employ a deformation MLP to predict time-varying 6 DoF transformations for each control point, which reduces learning complexities, enhances learning abilities, and facilitates obtaining temporal and spatial coherent motion patterns. Then, we jointly learn the 3D Gaussians, the canonical space locations of control points, and the deformation MLP to reconstruct the appearance, geometry, and dynamics of 3D scenes. During learning, the location and number of control points are adaptively adjusted to accommodate varying motion complexities in different regions, and an ARAP loss following the principle of as rigid as possible is developed to enforce spatial continuity and local rigidity of learned motions. Finally, thanks to the explicit sparse motion representation and its decomposition from appearance, our method can enable user-controlled motion editing while retaining high-fidelity appearances. Extensive experiments demonstrate that our approach outperforms existing approaches on novel view synthesis with a high rendering speed and enables novel appearance-preserved motion editing applications.

[Uncaptioned image]
Figure 1: Given (a) an image sequence from a monocular dynamic video, we propose to represent the motion with a set of sparse control points, which can be used to drive 3D Gaussians for high-fidelity rendering. Our approach enables both (b) dynamic view synthesis and (c) motion editing due to the motion representation based on sparse control points.

#This work is in collaboration with VAST.
*Equal Contribution \daggerCorresponding Author
Project page: https://yihua7.github.io/SC-GS-web/.

1 Introduction

Novel view synthesis from a monocular video is a crucial problem with many applications in virtual reality, gaming, and the movie industry. However, extracting scene geometry and appearance from limited observations  [30, 49, 31] is challenging. Concurrently, real-world scenes often contain dynamic objects, which pose additional challenges in representing object movements accurately to reflect real-world dynamics [37, 19, 33, 34, 18]. Recent advancements in this area are primarily driven by neural radiance fields (NeRF)  [30, 37, 19, 66], which utilizes an implicit function to simultaneously learn scene geometry [29, 26] and textures [12, 57] from multi-view images. Despite significant progress, NeRF-based representations still struggle with low rendering speeds and high memory usage. This issue is particularly evident when rendering at high resolutions [62, 6, 31], as they necessitate sampling hundreds of query points along each ray to predict color and opacity.

Most recently, Gaussian splatting [13] has shown remarkable performance in terms of rendering quality, resolution, and speed. Utilizing a point-based [14, 53, 2, 67, 15, 10, 46] scene representation, this method rasterizes 3D Gaussians to render images from specified views. It enables fast model training and real-time inference, achieving state-of-the-art (SOTA) visual quality. However, its existing formulation only applies to static scenes. It remains a challenge to incorporate object motion into the Gaussian representation without compromising rendering quality and speed. An intuitive method is to learn a flow vector for each 3D Gaussian. However, it will incur a significant time cost for training and inference. Moreover, it also leads to noisy trajectories and poor generalization in novel views, as demonstrated in Fig. 6 (a).

Motivated by the observation that real-world motions are often sparse, spatially continuous, and locally rigid, we propose to drive 3D Gaussians with learnable sparse control points (\approx512) compared to the number of Gaussians (\approx100K), in a much more compact space for modeling scene dynamics. These control points are associated with time-varying 6 DoF transformations parameterized as rotation using quaternion and translation parameters, which can be locally interpolated through learned interpolation weights to yield the motion field of dense Gaussians. These 6 DoF parameters on control points are predicted by an introduced MLP conditioned on time and location. Then, we jointly learn the canonical space 3D Gaussian parameters, locations, and radius of sparse control points at canonical space and the MLP for dynamic novel view synthesis. During learning, we introduce an adaptive strategy to adaptively change the number of sparse points to accommodate motion complexities in different regions and employ an ARAP loss that encourages the learned motions to be locally rigid.

Owing to the effective motion and appearance representations, our approach simultaneously enables high-quality dynamic view synthesis and motion editing, as shown in Fig. 1. We perform extensive experiments and ablation studies on benchmark datasets, demonstrating that our model surpasses existing methods both quantitatively and qualitatively while maintaining high rendering speeds. Furthermore, by learning a control graph from the scene motion, our control points-based motion representation allows for convenient motion editing, a feature not present in existing methods  [37, 5, 1, 38, 11]. More results for motion editing are included in Fig. 5 and the supplementary material. Our contributions can be summarized as follows:

  • We introduce sparse control points together with an MLP for modeling scene motion, based on the insight that motions within a scene can be represented by a compact subspace with a sparse set of bases.

  • We employ adaptive learning strategies and design a regularization loss based on rigid constraints to enable effective learning of appearances, geometry, and motion from a monocular video.

  • Thanks to the sparse motion representation, our approach enables motion editing by manipulating the learned control points while maintaining high-fidelity appearances.

  • Extensive experiments show our approach achieves SOTA performance quantitatively and qualitatively.

2 Related Work

Dynamic NeRF. Novel view synthesis has been a prominent topic in the academic field for several years. NeRF [30] models static scenes implicitly with MLPs, and many works [37, 19, 52, 45, 33, 34, 11, 66] have expanded its usage to dynamic scenes via a deformation field. Some methods [7, 18, 35] represent dynamic scenes as 4D radiance fields but face extensive computational costs due to ray point sampling and volume rendering. Several acceleration approaches have been used for dynamic scene modeling. DeVRF [25] introduces a grid representation, and IBR-based methods [23, 20, 22, 55] use multi-camera information for quality and efficiency. Other methods used primitives [27], predicted MLP maps [36], or grid/plane-based structures [40, 5, 1, 38, 47, 48] for speed and performance in various dynamic scenes. However, hybrid models underperform with high-rank dynamic scenes due to their low-rank assumption.

Dynamic Gaussian Splatting. Gaussian Splatting [13, 51] offers improved rendering quality and speed for radiance fields. Several concurrent works have adapted 3D Gaussians for dynamic scenes. Luiten et al. [28] utilizes frame-by-frame training, suitable for multi-view scenes. Yang et al. [58] separate scenes into 3D Gaussians and a deformation field for monocular scenes but face slow training due to an extra MLP for learning Gaussian offsets. Following [58], Wu et al. [50] replaced the MLP with multi-resolution hex-planes [1] and a lightweight MLP. Yang et al. [59] include time as an additional feature in 4D Gaussians but face quality issues compared to constraints in canonical space. Our work proposes using sparse control points to drive the deformation of 3D Gaussians, which enhances rendering quality and reduces MLP query overhead. The learned control point graph can also be used for motion editing.

3D Deformation and Editing. Traditional deformation methods in computer graphics are typically based on Laplacian coordinates [24, 43, 42, 41, 8], Poisson equation [63] and cage-based approaches [61, 69]. These methods primarily focus on preserving the geometric details of 3D objects during the deformation process. In recent years, there have been other approaches [64, 65, 54, 70, 21] that aim to edit the scene geometry learned from 2D images. These methods prioritize the rendering quality of the edited scene. Our approach falls into this category. However, instead of relying on the implicit and computationally expensive NeRF-based approach, our method employs an explicit point-based control graph deformation strategy and Gaussian rendering, which is more intuitive and efficient.

Refer to caption
Figure 2: We present a novel method of employing sparse control points and a deformation MLP to direct 3D Gaussian dynamics. The MLP uses canonical control point coordinates and time to obtain per-control-point 6-DOF transformations, which guide 3D Gaussian deformation based on K nearest control points. Transformed Gaussians can then be rendered into images, and rendering loss calculated, before backpropagating gradients to optimize the Gaussians, control points, and MLP. Gaussian and control point density are adaptively managed during training.

3 Preliminaries

Gaussian splatting represents a 3D scene using colored 3D Gaussians [13]. Each Gaussian G𝐺Gitalic_G has a 3D center location μ𝜇\muitalic_μ and a 3D covariance matrix ΣΣ\Sigmaroman_Σ,

G(x)=e12(xμ)TΣ1(xμ).𝐺𝑥superscript𝑒12superscript𝑥𝜇𝑇superscriptΣ1𝑥𝜇G(x)=e^{-\frac{1}{2}(x-\mu)^{T}\Sigma^{-1}(x-\mu)}.italic_G ( italic_x ) = italic_e start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( italic_x - italic_μ ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT roman_Σ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_x - italic_μ ) end_POSTSUPERSCRIPT . (1)

The covariance matrix ΣΣ\Sigmaroman_Σ is decomposed as Σ=RSSTRTΣ𝑅𝑆superscript𝑆𝑇superscript𝑅𝑇\Sigma=RSS^{T}R^{T}roman_Σ = italic_R italic_S italic_S start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_R start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT for optimization, with R𝑅Ritalic_R as a rotation matrix represented by a quaternion q𝐒𝐎(3)𝑞𝐒𝐎3q\in\mathbf{SO}(3)italic_q ∈ bold_SO ( 3 ), and S𝑆Sitalic_S as a scaling matrix represented by a 3D vector s𝑠sitalic_s. Each Gaussian has an opacity value σ𝜎\sigmaitalic_σ to adjust its influence in rendering and is associated with sphere harmonic (SH) coefficients sh𝑠shitalic_s italic_h for view-dependent appearance. A scene is parameterized as a set of Gaussians 𝒢={Gj:μj,qj,sj,σj,shj}𝒢conditional-setsubscript𝐺𝑗subscript𝜇𝑗subscript𝑞𝑗subscript𝑠𝑗subscript𝜎𝑗𝑠subscript𝑗\mathcal{G}=\{G_{j}:\mu_{j},q_{j},s_{j},\sigma_{j},sh_{j}\}caligraphic_G = { italic_G start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT : italic_μ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_σ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_s italic_h start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT }.

Rendering an image involves projecting these Gaussians onto the 2D image plane and aggregating them using fast α𝛼\alphaitalic_α-blending. The 2D covariance matrix and center are Σ=JWΣWTJTsuperscriptΣ𝐽𝑊Σsuperscript𝑊𝑇superscript𝐽𝑇\Sigma^{\prime}=JW\Sigma W^{T}J^{T}roman_Σ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_J italic_W roman_Σ italic_W start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_J start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT and μ=JWμsuperscript𝜇𝐽𝑊𝜇\mu^{\prime}=JW\muitalic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_J italic_W italic_μ. The color C(u)𝐶𝑢C(u)italic_C ( italic_u ) of a pixel u𝑢uitalic_u is rendered using a neural point-based α𝛼\alphaitalic_α-blending as,

C(u)=iNTiαi𝒮(shi,vi), where Ti=Πj=1i1(1αj).formulae-sequence𝐶𝑢subscript𝑖𝑁subscript𝑇𝑖subscript𝛼𝑖𝒮𝑠subscript𝑖subscript𝑣𝑖 where subscript𝑇𝑖superscriptsubscriptΠ𝑗1𝑖11subscript𝛼𝑗\small C({u})=\sum_{i\in N}T_{i}\alpha_{i}\mathcal{SH}(sh_{i},v_{i}),\text{ % where }T_{i}=\Pi_{j=1}^{i-1}(1-\alpha_{j}).italic_C ( italic_u ) = ∑ start_POSTSUBSCRIPT italic_i ∈ italic_N end_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT caligraphic_S caligraphic_H ( italic_s italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , where italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = roman_Π start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i - 1 end_POSTSUPERSCRIPT ( 1 - italic_α start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) . (2)

Here, 𝒮𝒮\mathcal{SH}caligraphic_S caligraphic_H is the spherical harmonic function and visubscript𝑣𝑖v_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the view direction. αisubscript𝛼𝑖\alpha_{i}italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is calculated by evaluating the corresponding projected Gaussian Gisubscript𝐺𝑖G_{i}italic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT at pixel u𝑢uitalic_u as,

αi=σie12(pμi)TΣi(pμi),subscript𝛼𝑖subscript𝜎𝑖superscript𝑒12superscript𝑝superscriptsubscript𝜇𝑖𝑇superscriptsubscriptΣ𝑖𝑝superscriptsubscript𝜇𝑖\alpha_{i}=\sigma_{i}e^{-\frac{1}{2}({p}-\mu_{i}^{\prime})^{T}\Sigma_{i}^{% \prime}({p}-\mu_{i}^{\prime})},italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( italic_p - italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT roman_Σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_p - italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT , (3)

where μisuperscriptsubscript𝜇𝑖\mu_{i}^{\prime}italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and ΣisuperscriptsubscriptΣ𝑖\Sigma_{i}^{\prime}roman_Σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT are the center point and covariance matrix of Gaussian Gisubscript𝐺𝑖G_{i}italic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, respectively. By optimizing the Gaussian parameters {Gj:μj,qj,sj,σj,cj}conditional-setsubscript𝐺𝑗subscript𝜇𝑗subscript𝑞𝑗subscript𝑠𝑗subscript𝜎𝑗subscript𝑐𝑗\{G_{j}:\mu_{j},q_{j},s_{j},\sigma_{j},c_{j}\}{ italic_G start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT : italic_μ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_σ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } and adjusting Gaussian density adaptively, high-quality images can be synthesized in real-time. We further introduce sparse control points to adapt Gaussian splatting for dynamic scenes while maintaining rendering quality and speed.

4 Method

Our goal is to reconstruct a dynamic scene from a monocular video. We represent the geometry and appearance of the dynamic scene using Gaussians in the canonical space while modeling the motion through a set of control points together with time-varying 6DoF transformations predicted by an MLP. These learned control points and corresponding transformations can be utilized to drive the deformation of Gaussians across different timesteps. The number of control points is significantly smaller than that of Gaussians, resulting in a set of compact motion bases for modeling scene dynamics and further facilitating motion editing. An overview of our method is shown in Fig. 2. In the following, we first present the sparse control points for representing compact motion bases in Sec. 4.1, followed by the dynamic scene rendering formulation in Sec. 4.2 and optimization process in Sec. 4.3.

4.1 Sparse Control Points

To derive a compact motion representation, we introduce a set of sparse control points 𝒫={(pi3,oi+)},i{1,2,,Np}formulae-sequence𝒫formulae-sequencesubscript𝑝𝑖superscript3subscript𝑜𝑖superscript𝑖12subscript𝑁𝑝\mathcal{P}=\{(p_{i}\in\mathbb{R}^{3},o_{i}\in\mathbb{R}^{+})\},i\in\{1,2,% \cdots,N_{p}\}caligraphic_P = { ( italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT , italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ) } , italic_i ∈ { 1 , 2 , ⋯ , italic_N start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT }. Here, pisubscript𝑝𝑖p_{i}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denotes the learnable coordinate of control point i𝑖iitalic_i in the canonical space. oisubscript𝑜𝑖o_{i}italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the learnable radius parameter of a radial-basis-function (RBF) kernel that controls how the impact of a control point on a Gaussian will decrease as their distances increase. Npsubscript𝑁𝑝N_{p}italic_N start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT is the total number of control points, which is considerably fewer than that of Gaussians.

For each control point k𝑘kitalic_k, we learn time-varying 6 DoF transformations [Rit|Tit]𝐒𝐄(3)delimited-[]conditionalsuperscriptsubscript𝑅𝑖𝑡superscriptsubscript𝑇𝑖𝑡𝐒𝐄3[R_{i}^{t}|T_{i}^{t}]\in\mathbf{SE}(3)[ italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT | italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ] ∈ bold_SE ( 3 ) , consisting of a local frame rotation matrix Rit𝐒𝐎(3)superscriptsubscript𝑅𝑖𝑡𝐒𝐎3R_{i}^{t}\in\mathbf{SO}(3)italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∈ bold_SO ( 3 ) and a translation vector Tit3superscriptsubscript𝑇𝑖𝑡superscript3T_{i}^{t}\in\mathbb{R}^{3}italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT. Instead of directly optimizing the transformation parameters for each control point at different time steps, we employ an MLP ΨΨ\Psiroman_Ψ to learn a time-varying transformation field and query the transformation of each control point pksubscript𝑝𝑘p_{k}italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT at each timestep t𝑡titalic_t as:

Ψ:(pi,t)(Rit,Tit).:Ψsubscript𝑝𝑖𝑡superscriptsubscript𝑅𝑖𝑡superscriptsubscript𝑇𝑖𝑡\Psi:(p_{i},t)\rightarrow(R_{i}^{t},T_{i}^{t}).roman_Ψ : ( italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_t ) → ( italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) . (4)

Note that in practical implementations, Ritsuperscriptsubscript𝑅𝑖𝑡R_{i}^{t}italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT is represented equivalently as a quaternion ritsuperscriptsubscript𝑟𝑖𝑡r_{i}^{t}italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT for more stable optimization and convenient interpolation for generating the motions of Gaussians in the follow-up steps.

4.2 Dynamic Scene Rendering

Equipped with the time-varying transformation parameters (Rit,Tit)superscriptsubscript𝑅𝑖𝑡superscriptsubscript𝑇𝑖𝑡(R_{i}^{t},T_{i}^{t})( italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) for sparse control points which form a set of compact motion bases, the next step is to determine the transformation of each Gaussian at different time steps to derive the motion of the entire scene. We derive the dense motion field of Gaussians using linear blend skinning (LBS) [44] by locally interpolating the transformations of their neighboring control points. Specifically, for each Gaussian Gj:(μj,qj,sj,σj,shj):subscript𝐺𝑗subscript𝜇𝑗subscript𝑞𝑗subscript𝑠𝑗subscript𝜎𝑗𝑠subscript𝑗G_{j}:(\mu_{j},q_{j},s_{j},\sigma_{j},sh_{j})italic_G start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT : ( italic_μ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_σ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_s italic_h start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ), we use k-nearest neighbor (KNN) search to obtain its K(=4)annotated𝐾absent4K(=4)italic_K ( = 4 ) neighboring control points denoted as {pk|k𝒩j}conditional-setsubscript𝑝𝑘𝑘subscript𝒩𝑗\{p_{k}|k\in\mathcal{N}_{j}\}{ italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | italic_k ∈ caligraphic_N start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } in canonical space. Then, the interpolation weights for control point pksubscript𝑝𝑘p_{k}italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT can be computed with Gaussian-kernel RBF [9, 4, 32] as:

wjk=w^jkk𝒩jw^jk, where w^jk=exp(djk22ok2),subscript𝑤𝑗𝑘subscript^𝑤𝑗𝑘subscript𝑘subscript𝒩𝑗subscript^𝑤𝑗𝑘, where subscript^𝑤𝑗𝑘expsuperscriptsubscript𝑑𝑗𝑘22superscriptsubscript𝑜𝑘2\small w_{jk}=\frac{\hat{w}_{jk}}{\sum\limits_{k\in\mathcal{N}_{j}}\hat{w}_{jk% }}\text{, where }\ \hat{w}_{jk}=\text{exp}(-\frac{d_{jk}^{2}}{2o_{k}^{2}}),italic_w start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT = divide start_ARG over^ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_k ∈ caligraphic_N start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT over^ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT end_ARG , where over^ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT = exp ( - divide start_ARG italic_d start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 italic_o start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) , (5)

where djksubscript𝑑𝑗𝑘d_{jk}italic_d start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT is the distance between center of Gaussian Gjsubscript𝐺𝑗G_{j}italic_G start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT and the neighboring control point pksubscript𝑝𝑘p_{k}italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, and oksubscript𝑜𝑘o_{k}italic_o start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is the learned radius parameter of pksubscript𝑝𝑘p_{k}italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT. During training, these interpolation weights are adaptable to model complex motions by encouraging the learnable radius parameters to be optimized in a way that can accurately reconstruct the video frames.

Using the interpolation weights of neighboring control points, we can calculate a Gaussian motion field through interpolation. Following dynamic fusion works [32, 17, 4], we employ LBS [44] to compute the warped Gaussian μjtsuperscriptsubscript𝜇𝑗𝑡\mu_{j}^{t}italic_μ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT and qjtsuperscriptsubscript𝑞𝑗𝑡q_{j}^{t}italic_q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT as Eq. (6) and Eq. (7) for simplicity and efficiency:

μjtsuperscriptsubscript𝜇𝑗𝑡\displaystyle\mu_{j}^{t}italic_μ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT =k𝒩jwjk(Rkt(μjpk)+pk+Tkt),absentsubscript𝑘subscript𝒩𝑗subscript𝑤𝑗𝑘superscriptsubscript𝑅𝑘𝑡subscript𝜇𝑗subscript𝑝𝑘subscript𝑝𝑘superscriptsubscript𝑇𝑘𝑡\displaystyle=\sum\limits_{k\in\mathcal{N}_{j}}w_{jk}\left(R_{k}^{t}(\mu_{j}-p% _{k})+p_{k}+T_{k}^{t}\right),= ∑ start_POSTSUBSCRIPT italic_k ∈ caligraphic_N start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT ( italic_R start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_μ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) + italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + italic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) , (6)
qjtsuperscriptsubscript𝑞𝑗𝑡\displaystyle q_{j}^{t}italic_q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT =(k𝒩jwjkrkt)qj,absenttensor-productsubscript𝑘subscript𝒩𝑗subscript𝑤𝑗𝑘superscriptsubscript𝑟𝑘𝑡subscript𝑞𝑗\displaystyle=(\sum\limits_{k\in\mathcal{N}_{j}}w_{jk}r_{k}^{t})\otimes q_{j},= ( ∑ start_POSTSUBSCRIPT italic_k ∈ caligraphic_N start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ⊗ italic_q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , (7)

where Rkt3×3superscriptsubscript𝑅𝑘𝑡superscript33R_{k}^{t}\in\mathbb{R}^{3\times 3}italic_R start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 3 × 3 end_POSTSUPERSCRIPT and rkt4superscriptsubscript𝑟𝑘𝑡superscript4r_{k}^{t}\in\mathbb{R}^{4}italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT are the matrix and quaternion representations of predicted rotation on control point k𝑘kitalic_k respectively. tensor-product\otimes is the production of quaternions, obtaining the quaternion of the composition of corresponding rotation transforms. Then, with the updated Gaussian parameters, we are able to perform rendering at time step t𝑡titalic_t following Eq. (2) and Eq. (3).

4.3 Optimization

Our dynamic scene representation consists of control points 𝒫𝒫\mathcal{P}caligraphic_P and Gaussians G𝐺Gitalic_G in the canonical space and the deformation MLP ΨΨ\Psiroman_Ψ. To stabilize the training process, we first pre-train 𝒫𝒫\mathcal{P}caligraphic_P and ΨΨ\Psiroman_Ψ to model the coarse scene motion with the Gaussians 𝒢𝒢\mathcal{G}caligraphic_G fixed. The details are included in the supplementary material. Then, the whole model is optimized jointly. To facilitate learning, we introduce an ARAP loss to encourage the learned motion of control points to be locally rigid and employ an adaptive density adjustment strategy to adapt to varying motion complexities in different areas.

ARAP Loss and Overall Optimization Objective. To avoid local minima and regularize the unstructured control points, we introduce an ARAP loss arapsubscriptarap\mathcal{L}_{\text{arap}}caligraphic_L start_POSTSUBSCRIPT arap end_POSTSUBSCRIPT that encourages their motions to be locally rigid, following the principle of being as rigid as possible [41]. Before computing the ARAP loss for control points, it is necessary to identify the edges that connect them. To avoid linking unrelated points, we opt to connect the points that have closely aligned trajectories in the scene motion. Specifically, for a control point pjsubscript𝑝𝑗p_{j}italic_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, we firstly calculate its trajectory pitrajsuperscriptsubscript𝑝𝑖trajp_{i}^{\text{traj}}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT traj end_POSTSUPERSCRIPT that includes its locations across Nt(=8)annotatedsubscript𝑁𝑡absent8N_{t}(=8)italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( = 8 ) randomly sampled time steps as:

pitraj=1Ntpit1pit2pitNt,superscriptsubscript𝑝𝑖trajdirect-sum1subscript𝑁𝑡superscriptsubscript𝑝𝑖subscript𝑡1superscriptsubscript𝑝𝑖subscript𝑡2superscriptsubscript𝑝𝑖subscript𝑡subscript𝑁𝑡p_{i}^{\text{traj}}=\frac{1}{N_{t}}p_{i}^{t_{1}}\oplus p_{i}^{t_{2}}\oplus% \cdots\oplus p_{i}^{t_{N_{t}}},italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT traj end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ⊕ italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ⊕ ⋯ ⊕ italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , (8)

where direct-sum\oplus denotes vector concatenation operation. Based on the trajectories obtained, we perform ball queries and use all control points 𝒩cisubscript𝒩subscript𝑐𝑖\mathcal{N}_{c_{i}}caligraphic_N start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT within a pre-defined radius to define a local area. Then, to calculate arapsubscriptarap\mathcal{L}_{\text{arap}}caligraphic_L start_POSTSUBSCRIPT arap end_POSTSUBSCRIPT, we randomly sample two time steps t1subscript𝑡1t_{1}italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and t2subscript𝑡2t_{2}italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. For each point pksubscript𝑝𝑘p_{k}italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT within the radius (i.e. k𝒩ci𝑘subscript𝒩subscript𝑐𝑖k\in\mathcal{N}_{c_{i}}italic_k ∈ caligraphic_N start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT), its transformed locations with learned translation parameters Tkt1superscriptsubscript𝑇𝑘subscript𝑡1T_{k}^{t_{1}}italic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT and Tkt2superscriptsubscript𝑇𝑘subscript𝑡2T_{k}^{t_{2}}italic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT are: pkt1=pk+Tkt1superscriptsubscript𝑝𝑘subscript𝑡1subscript𝑝𝑘superscriptsubscript𝑇𝑘subscript𝑡1p_{k}^{t_{1}}=p_{k}+T_{k}^{t_{1}}italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT = italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + italic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT and pkt2=pk+Tkt2superscriptsubscript𝑝𝑘subscript𝑡2subscript𝑝𝑘superscriptsubscript𝑇𝑘subscript𝑡2p_{k}^{t_{2}}=p_{k}+T_{k}^{t_{2}}italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT = italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + italic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, thus the rotation matrix R^isubscript^𝑅𝑖\hat{R}_{i}over^ start_ARG italic_R end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT can be estimated following a rigid motion assumption [41] as:

R^i=argminR𝐒𝐎(3)k𝒩ciwik(pit1pkt1)R(pit2pkt2)2.subscript^𝑅𝑖subscriptargmin𝑅𝐒𝐎3subscript𝑘subscript𝒩subscript𝑐𝑖subscript𝑤𝑖𝑘superscriptnormsuperscriptsubscript𝑝𝑖subscript𝑡1superscriptsubscript𝑝𝑘subscript𝑡1𝑅superscriptsubscript𝑝𝑖subscript𝑡2superscriptsubscript𝑝𝑘subscript𝑡22\small\hat{R}_{i}=\operatorname*{arg\,min}\limits_{R\in\mathbf{SO}(3)}\sum_{k% \in\mathcal{N}_{c_{i}}}{w}_{ik}||(p_{i}^{t_{1}}-p_{k}^{t_{1}})-{R}(p_{i}^{t_{2% }}-p_{k}^{t_{2}})||^{2}.over^ start_ARG italic_R end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT italic_R ∈ bold_SO ( 3 ) end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_k ∈ caligraphic_N start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT | | ( italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT - italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) - italic_R ( italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT - italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (9)

Here wiksubscript𝑤𝑖𝑘w_{ik}italic_w start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT is calculated similarly to wjksubscript𝑤𝑗𝑘w_{jk}italic_w start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT in Eq. (5) by replacing Gaussian position μjsubscript𝜇𝑗\mu_{j}italic_μ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT with control point position pisubscript𝑝𝑖p_{i}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, which weights the contribution of different neighboring points pksubscript𝑝𝑘p_{k}italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT according to their impact on pisubscript𝑝𝑖p_{i}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Eq. (9) can be easily solved using SVD decomposition according to [41]. Then, arapsubscriptarap\mathcal{L}_{\text{arap}}caligraphic_L start_POSTSUBSCRIPT arap end_POSTSUBSCRIPT is designed as,

arap(pi,t1,t2)=k𝒩ciwik(pit1pkt1)R^i(pit2pkt2)2,subscriptarapsubscript𝑝𝑖subscript𝑡1subscript𝑡2subscript𝑘subscript𝒩subscript𝑐𝑖subscript𝑤𝑖𝑘superscriptnormsuperscriptsubscript𝑝𝑖subscript𝑡1superscriptsubscript𝑝𝑘subscript𝑡1subscript^𝑅𝑖superscriptsubscript𝑝𝑖subscript𝑡2superscriptsubscript𝑝𝑘subscript𝑡22\footnotesize\mathcal{L}_{\text{arap}}(p_{i},t_{1},t_{2})=\sum_{k\in\mathcal{N% }_{c_{i}}}{w}_{ik}||(p_{i}^{t_{1}}-p_{k}^{t_{1}})-\hat{R}_{i}(p_{i}^{t_{2}}-p_% {k}^{t_{2}})||^{2},caligraphic_L start_POSTSUBSCRIPT arap end_POSTSUBSCRIPT ( italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_k ∈ caligraphic_N start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT | | ( italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT - italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) - over^ start_ARG italic_R end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT - italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (10)

which evaluates the degree to which the learned motion deviates from the assumption of local rigidity. By penalizing arapsubscriptarap\mathcal{L}_{\text{arap}}caligraphic_L start_POSTSUBSCRIPT arap end_POSTSUBSCRIPT, the learned motions are encouraged to be locally rigid. The rigid regularization significantly enhances the learned motion with visualizations shown in Fig. 6.

For optimization, besides Larapsubscript𝐿arapL_{\text{arap}}italic_L start_POSTSUBSCRIPT arap end_POSTSUBSCRIPT, the rendering loss rendersubscriptrender\mathcal{L}_{\text{render}}caligraphic_L start_POSTSUBSCRIPT render end_POSTSUBSCRIPT is derived by comparing the rendered image at different time steps with ground truth reference images using a combination of 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT loss and D-SSIM loss following [13]. Finally, the overall loss is constructed as: =render+arapsubscriptrendersubscriptarap\mathcal{L}=\mathcal{L}_{\text{render}}+\mathcal{L}_{\text{arap}}caligraphic_L = caligraphic_L start_POSTSUBSCRIPT render end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT arap end_POSTSUBSCRIPT.

Table 1: Quantitative comparison on D-NeRF [37] datasets. We present the average PSNR/SSIM/LPIPS (VGG) values for novel view synthesis on dynamic scenes from D-NeRF, with each cell colored to indicate the best, second best, and third best.
Methods Hook Jum**jacks Trex BouncingBalls
PSNR(\uparrow) SSIM(\uparrow) LPIPS(\downarrow) PSNR(\uparrow) SSIM(\uparrow) LPIPS(\downarrow) PSNR(\uparrow) SSIM(\uparrow) LPIPS(\downarrow) PSNR(\uparrow) SSIM(\uparrow) LPIPS(\downarrow)
D-NeRF [37] 29.25 .968 .1120 32.80 .981 .0381 31.75 .974 .0367 38.93 .987 .1074
TiNeuVox-B [5] 31.45 .971 .0569 34.23 .986 .0383 32.70 .987 .0340 40.73 .991 .0472
Tensor4D [40] 29.03 .955 .0499 24.01 .919 .0768 23.51 .934 .0640 25.36 .961 .0411
K-Planes [38] 28.59 .953 .0581 32.27 .971 .0389 31.41 .980 .0234 40.61 .991 .0297
FF-NVS [11] 32.29 .980 .0400 33.55 .980 .0300 30.71 .960 .0400 40.02 .990 .0400
4D-GS [50] 30.99 .990 .0248 33.59 .990 .0242 32.16 .988 .0216 38.59 .993 .0267
Baseline 34.47 .990 .0195 35.74 .992 .0178 36.37 .994 .0103 41.45 .996 .0190
Ours 39.87 .997 .0076 41.13 .998 .0067 41.24 .998 .0046 44.91 .998 .0166
Methods Hellwarrior Mutant Standup Average
PSNR(\uparrow) SSIM(\uparrow) LPIPS(\downarrow) PSNR(\uparrow) SSIM(\uparrow) LPIPS(\downarrow) PSNR(\uparrow) SSIM(\uparrow) LPIPS(\downarrow) PSNR(\uparrow) SSIM(\uparrow) LPIPS(\downarrow)
D-NeRF [37] 25.02 .955 .0633 31.29 .978 .0212 32.79 .983 .0241 31.69 .975 .0575
TiNeuVox-B [5] 28.17 .978 .0706 33.61 .982 .0388 35.43 .991 .0230 33.76 .983 .0441
Tensor4D [40] 31.40 .925 .0675 29.99 .951 .0422 30.86 .964 .0214 27.62 .947 .0471
K-Planes [38] 25.27 .948 .0775 33.79 .982 .0207 34.31 .984 .0194 32.32 .973 .0382
FF-NVS [11] 27.71 .970 .0500 34.97 .980 .0300 36.91 .990 .0200 33.73 .979 .0357
4D-GS [50] 31.39 .974 .0436 35.98 .996 .0120 35.37 .994 .0136 34.01 .987 .0316
Baseline 39.07 .982 .0350 41.45 .998 .0045 41.04 .996 .0071 38.51 .992 .0162
Ours 42.93 .994 .0155 45.19 .999 .0028 47.89 .999 .0023 43.31 .997 .0063

Adaptive Control Points. Following [13], we also develop an adaptive density adjustment strategy to add and prune control points, which adjusts their distributions for modeling varying motion complexities, e.g. areas that exhibit complex motion patterns typically require control points of high densities. 1) To determine whether a control point pisubscript𝑝𝑖p_{i}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT should be pruned, we calculate its overall impact Wi=j𝒩~iwjisubscript𝑊𝑖subscript𝑗subscript~𝒩𝑖subscript𝑤𝑗𝑖W_{i}=\sum_{j\in\tilde{\mathcal{N}}_{i}}w_{ji}italic_W start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_j ∈ over~ start_ARG caligraphic_N end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_j italic_i end_POSTSUBSCRIPT on the set of Gaussians j𝒩~i𝑗subscript~𝒩𝑖j\in\tilde{\mathcal{N}}_{i}italic_j ∈ over~ start_ARG caligraphic_N end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT whose K nearest neighbors include pisubscript𝑝𝑖p_{i}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Then, we prune pisubscript𝑝𝑖p_{i}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT if Wisubscript𝑊𝑖W_{i}italic_W start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is close to zero, indicating little contribution to the motion of 3D Gaussians. 2) To determine whether a control point pksubscript𝑝𝑘p_{k}italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT should be cloned, we calculate the summation of Gaussian gradient norm with respect to Gaussians in set 𝒩~ksubscript~𝒩𝑘\tilde{\mathcal{N}}_{k}over~ start_ARG caligraphic_N end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT as:

gi=j𝒩~iw~jddμj22, where w~j=wjij𝒩~kwji.subscript𝑔𝑖subscript𝑗subscript~𝒩𝑖subscript~𝑤𝑗subscriptsuperscriptnorm𝑑𝑑subscript𝜇𝑗22, where subscript~𝑤𝑗subscript𝑤𝑗𝑖subscript𝑗subscript~𝒩𝑘subscript𝑤𝑗𝑖\small g_{i}=\sum\limits_{j\in\tilde{\mathcal{N}}_{i}}\tilde{w}_{j}||\frac{d% \mathcal{L}}{d\mu_{j}}||^{2}_{2}\text{, where }\tilde{w}_{j}=\frac{w_{ji}}{% \sum\limits_{j\in\tilde{\mathcal{N}}_{k}}w_{ji}}.italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_j ∈ over~ start_ARG caligraphic_N end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT over~ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | | divide start_ARG italic_d caligraphic_L end_ARG start_ARG italic_d italic_μ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , where over~ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = divide start_ARG italic_w start_POSTSUBSCRIPT italic_j italic_i end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j ∈ over~ start_ARG caligraphic_N end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_j italic_i end_POSTSUBSCRIPT end_ARG . (11)

A large gksubscript𝑔𝑘g_{k}italic_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT indicates poor reconstructions. Therefore, we clone pksubscript𝑝𝑘p_{k}italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and add a new control point pksubscriptsuperscript𝑝𝑘p^{\prime}_{k}italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT to the expected position of related Gaussians to improve the reconstruction:

pk=j𝒩~kw~iμj;σk=σk.formulae-sequencesubscriptsuperscript𝑝𝑘subscript𝑗subscript~𝒩𝑘subscript~𝑤𝑖subscript𝜇𝑗subscriptsuperscript𝜎𝑘subscript𝜎𝑘p^{\prime}_{k}=\sum\limits_{j\in\tilde{\mathcal{N}}_{k}}\tilde{w}_{i}\mu_{j};% \ \sigma^{\prime}_{k}=\sigma_{k}.italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_j ∈ over~ start_ARG caligraphic_N end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT over~ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ; italic_σ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = italic_σ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT . (12)

5 Motion Editing

Since our approach utilizes an explicit and sparse motion representation, it further allows for efficient and intuitive motion editing through the manipulation of control points. It is achieved by predicting the trajectory of each control point across different time steps, determining their neighborhoods, constructing a rigid control graph, and performing motion editing by graph deformation.

Control Point Graph. With the trained control points 𝒫𝒫\mathcal{P}caligraphic_P and the MLP ΨΨ\Psiroman_Ψ, we construct a control point graph 𝒢𝒢\mathcal{G}caligraphic_G that connects control points based on their trajectories. For each vertex of the graph, i.e., control point pisubscript𝑝𝑖p_{i}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, we firstly calculate its trajectory pitrajsuperscriptsubscript𝑝𝑖trajp_{i}^{\text{traj}}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT traj end_POSTSUPERSCRIPT derived from Eq. (8). Then, the vertex is connected with other vertices that fall within a ball of a pre-determined radius parameter based on pitrajsuperscriptsubscript𝑝𝑖trajp_{i}^{\text{traj}}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT traj end_POSTSUPERSCRIPT. The edge weights wijsubscript𝑤𝑖𝑗w_{ij}italic_w start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT between two connected vertices pisubscript𝑝𝑖p_{i}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and pjsubscript𝑝𝑗p_{j}italic_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT are calculated using Eq. (5). Building the control graph based on point trajectory helps take into account the overall motion sequence instead of a single timestep, which avoids unreasonable edge connections. We demonstrate the advantage of this choice in the supplementary material.

Motion Editing. In order to maintain the local rigidity, we perform ARAP [41] deformation on the control graph based on constraints specified by users. Mathematically, given a set of user-defined handle points {hl3|l{1,2,,Np}}conditional-setsubscript𝑙superscript3𝑙12subscript𝑁𝑝\{h_{l}\in\mathbb{R}^{3}\ |l\in\mathcal{H}\subset\{1,2,\cdots,N_{p}\}\}{ italic_h start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT | italic_l ∈ caligraphic_H ⊂ { 1 , 2 , ⋯ , italic_N start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT } }, the control graph 𝒫superscript𝒫\mathcal{P}^{\prime}caligraphic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT can be deformed by minimizing the APAR energy formulated as:

E(𝒫)=i=1Npj𝒩iwij(pipj)R^i(pipj)2,𝐸superscript𝒫superscriptsubscript𝑖1subscript𝑁𝑝subscript𝑗subscript𝒩𝑖subscript𝑤𝑖𝑗superscriptnormsuperscriptsubscript𝑝𝑖superscriptsubscript𝑝𝑗subscript^𝑅𝑖subscript𝑝𝑖subscript𝑝𝑗2\small E(\mathcal{P}^{\prime})=\sum\limits_{i=1}^{N_{p}}\sum\limits_{j\in{% \mathcal{N}}_{i}}{w}_{ij}||(p_{i}^{\prime}-p_{j}^{\prime})-\hat{R}_{i}(p_{i}-p% _{j})||^{2},italic_E ( caligraphic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j ∈ caligraphic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT | | ( italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - italic_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) - over^ start_ARG italic_R end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (13)

with the fixed position condition pl=hlsuperscriptsubscript𝑝𝑙subscript𝑙p_{l}^{\prime}=h_{l}italic_p start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_h start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT for l𝑙l\in\mathcal{H}italic_l ∈ caligraphic_H. Here R^isubscript^𝑅𝑖\hat{R}_{i}over^ start_ARG italic_R end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the rigid local rotation defined on each control point. This optimization problem can be efficiently solved by alternately optimizing local rotations R^isubscript^𝑅𝑖\hat{R}_{i}over^ start_ARG italic_R end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and deformed control point positions psuperscript𝑝p^{\prime}italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. We refer the readers to  [41] for the specific optimization process. The solved rotation R^isubscript^𝑅𝑖\hat{R}_{i}over^ start_ARG italic_R end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and translation T^i=pipisubscript^𝑇𝑖superscriptsubscript𝑝𝑖subscript𝑝𝑖\hat{T}_{i}=p_{i}^{\prime}-p_{i}over^ start_ARG italic_T end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT form a 6 DoF transformation for each control point, which is consistent with our motion representation. Thus, Gaussians can be warped by the deformed control points by simply replacing the transformation in Eq. (6) and Eq. (7), which can be rendered into high-quality edited images even for motion out of the training sequence. We visualize the motion editing results in Fig. 5.

Table 2: Quantitative comparison on NeRF-DS [56] datasets. We display the average PSNR/MS-SSIM/LPIPS (Alex) metrics for novel view synthesis on dynamic scenes from NeRF-DS, with each cell colored to indicate the best, second best, and third best.
Methods Bell Sheet Press Basin
PSNR(\uparrow) MS-SSIM(\uparrow) LPIPS(\downarrow) PSNR(\uparrow) MS-SSIM(\uparrow) LPIPS(\downarrow) PSNR(\uparrow) MS-SSIM(\uparrow) LPIPS(\downarrow) PSNR(\uparrow) MS-SSIM(\uparrow) LPIPS(\downarrow)
HyperNeRF [34] 24.0 .884 .159 24.3 .874 .148 25.4 .873 .164 20.2 .829 .168
NeRF-DS [56] 23.3 .872 .134 25.7 .918 .115 26.4 .911 .123 20.3 .868 .127
TiNeuVox-B [5] 23.1 .876 .113 21.1 .745 .234 24.1 .892 .133 20.7 .896 .105
Baseline 24.9 .917 .124 26.1 .903 .127 25.1 .884 .221 19.6 .852 .144
Ours 25.1 .918 .117 26.2 .898 .142 26.6 .901 .135 19.6 .846 .154
Methods Cup Sieve Plate Average
PSNR(\uparrow) MS-SSIM(\uparrow) LPIPS(\downarrow) PSNR(\uparrow) MS-SSIM(\uparrow) LPIPS(\downarrow) PSNR(\uparrow) MS-SSIM(\uparrow) LPIPS(\downarrow) PSNR(\uparrow) MS-SSIM(\uparrow) LPIPS(\downarrow)
HyperNeRF [34] 20.5 .705 .318 25.0 .909 .129 18.1 .714 .359 22.5 .827 .206
NeRF-DS [56] 24.5 .916 .118 26.1 .935 .108 20.8 .867 .164 23.9 .898 .127
TiNeuVox-B [5] 20.5 .806 .182 20.1 .822 .205 20.6 .863 .161 21.5 .843 .162
Baseline 24.7 .919 .116 25.3 .917 .109 20.3 .842 .214 23.7 .891 .151
Ours 24.5 .916 .115 26.0 .919 .114 20.2 .837 .202 24.1 .891 .140

6 Experiment

6.1 Datasets and Evaluation Metrics

To validate the superiority of our method, we conducted extensive experiments on D-NeRF [37] datasets and NeRF-DS [56] datasets. D-NeRF datasets contain eight dynamic scenes with 360superscript360360^{\circ}360 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT viewpoint settings, and the NeRF-DS datasets consist of seven captured videos with camera pose estimated using colmap [39]. The two datasets involve a variety of rigid and non-rigid deformation of various objects. The metrics we use to evaluate the performance are Peak Signal-to-Noise Ratio (PSNR), Structural Similarity(SSIM), Multiscale SSIM(MS-SSIM), and Learned Perceptual Image Patch Similarity (LPIPS) [68].

Refer to caption
Figure 3: Qualitative comparison of dynamic view synthesis on D-NeRF [37] datasets. We compare our method with state-of-the-art methods including D-NeRF [37], TiNeuVox-B [5], K-Planes [38], and 4D-GS [59]. Our method delivers a higher visual quality and preserves more details of dynamic scenes. Notably, in the Lego scene (bottom row), the train motion is inconsistent with the test motion.
Refer to caption
Figure 4: Qualitative comparisons of dynamic view synthesis on scenes from NeRF-DS [56]. Our method produces high-fidelity results even without specialized design for specular surfaces.

6.2 Quantitative Comparisons

D-NeRF Datasets. We compare our method against existing state-of-the-art methods: D-NeRF [37], TiNeuVox [5], Tensor4D [40], K-Planes [38], and FF-NVS [11] using the official implementations and follow the same data setting. Concurrent work 4D-GS [50] is also compared since the official code has been released. We also evaluate the baseline that directly applies estimated per-Gaussian transformation with a deformation MLP to demonstrate the effectiveness of control points. The comparisons are carried out on the resolution of 400x400, following the same approach as in previous methods [37, 5, 1]. We demonstrate the comparison results in Tab. 1. Our approach significantly outperforms others. The baseline method also achieves high synthesis quality thanks to the superiority of 3D Gaussians. However, without the regularization of compact motion bases, the baseline has difficulty in achieving global optima. We also report the rendering speed comparison in the supplementary material to show the efficiency of our method.

NeRF-DS Datasets. Although the datasets provide relatively accurate camera poses compared with [34], some inevitable estimation errors still exist. This resulted in a downgraded performance of our method. However, our approach still achieves the best visual quality compared with SOTA methods, as reported in Tab. 2. It’s worth mentioning that NeRF-DS outperforms both our method and the baseline on certain datasets, as it employs a specialized design for modeling the specular parts of dynamic objects. Despite this, our approach, which doesn’t employ any additional processes, still achieves a higher average performance.

Reconstructed Motion Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Edited Motion Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Figure 5: We visualize the reconstructed motion sequence from the dynamic scene (top) and edited motion sequence (bottom). Our approach generalizes well for motion out of the training set benefitting from the locally rigid motion space modeled by control points.

6.3 Qualitative Comparison

We also conduct qualitative comparisons to illustrate the advantages of our method over SOTA methods. The comparisons on D-NeRF datasets are shown in Fig. 3, where zoom-in images show the details of synthesized images. Our approach produces results closest to the ground truths and has the best visual quality. Note that, in the Lego scene, the motion in the test set does not align with that in the training set, as indicated in the bottom row of the figure. The same observation can also be seen in [58]. The qualitative comparisons conducted on the NeRF-DS dataset are also demonstrated in Fig. 4. It is clear that our method is capable of producing high-fidelity novel views, even in the absence of a specialized design for specular surfaces.

Table 3: We quantitatively evaluate the effect of control points and ARAP loss on D-NeRF [37] datasets.
Methods PSNR(\uparrow) SSIM(\uparrow) LPIPS(\downarrow)
w/o Control Points 38.512 0.9922 0.0162
w/o ARAP loss 42.617 0.9963 0.0067
Full 43.307 0.9976 0.0063

6.4 Ablation study

Control Points. Our motion representation driven by control points constructs a compact and sparse motion space, effectively mitigating overfitting on the training set. We quantitatively compare the novel view synthesis quality of our method with the baseline that does not utilize control points on both D-NeRF [37] and NeRF-DS [56] datasets, as presented in Tab. 1 and Tab. 2. To intuitively elucidate the effects of control points, we compare the results and visualize the trajectories of Gaussians driven either with or without control points in Fig. 6 (a) and (b). Clearly, directly predicting the motion of each Gaussian with an MLP leads to noise in Gaussian trajectories. While the baseline theoretically is more flexible in representing diverse motions, it tends to falter and descend into local minima during optimization, hindering it from achieving the global optimum.

ARAP Loss. Despite the control-point-driven motion representation providing effective regularization to Gaussian motions, there can be occasional breaches in rigidity. As evidenced in Fig. 6 (c), even though Gaussians achieve relatively smooth trajectories, some Gaussians on the arm move towards the girl’s torso instead of moving alongside the ascending arm. This issue arises due to the lack of constraints on the inter-relation of control points’ motions. By imposing ARAP loss on control points, such phenomena are eliminated, thus facilitating a robust motion reconstruction. Tab. 3 illustrates without ARAP loss, the performance of dynamic view synthesis on D-NeRF [37] slightly decreases.

Refer to caption
Figure 6: We visualize the rendering results and Gaussian trajectories of (a) the baseline method without control points, (b) our full method, and (c) our method without ARAP loss.

6.5 Motion Editing

Our method facilitates scene motion editing via the manipulation of control nodes, due to the explicit motion representation using control points. The learned correlation and weights between Gaussians and control points enable excellent generalization, even on motion beyond the training sequence. The reconstructed and edited motion sequences are demonstrated in Fig. 5.

7 Conclusion and Future Works

We present a method driving 3D Gaussians using control points and a deformation MLP, learnable from dynamic scenes. Our approach, combining a compact motion representation with adaptive learning strategies and rigid constraints, allows high-quality dynamic scene reconstruction and motion editing. Experiments showed our method outperforms existing approaches in the visual quality of synthesized dynamic novel views. However, limitations exist. The performance is prone to inaccurate camera poses, leading to reconstruction failures on datasets with inaccurate poses such as HyperNeRF [34]. The current approach also faces limitations in handling common specular effects, resulting in limited improvement on NeRF-DS [56] datasets. To address this, future work could focus on extending the method by incorporating Spec-Gaussian [60] with a specialized specular design. This enhancement would enable more accurate modeling of highlight and mirror effects. Furthermore, the presence of blurriness in videos with dynamic objects should be considered. To enhance the robustness of the proposed method, incorporating deblurring techniques [3, 16] for novel view synthesis can address this issue effectively.

Acknowledgement

This work has been supported by Hong Kong Research Grant Council - Early Career Scheme (Grant No. 27209621), General Research Fund Scheme (Grant No. 17202422), and RGC Matching Fund Scheme (RMGS). Part of the described research work is conducted in the JC STEM Lab of Robotics for Soft Materials funded by The Hong Kong Jockey Club Charities Trust.

References

  • Cao and Johnson [2023] Ang Cao and Justin Johnson. Hexplane: A fast representation for dynamic scenes. In CVPR, 2023.
  • Dai et al. [2020] Peng Dai, Yinda Zhang, Zhuwen Li, Shuaicheng Liu, and Bing Zeng. Neural point cloud rendering via multi-plane projection. In CVPR, 2020.
  • Dai et al. [2023] Peng Dai, Yinda Zhang, Xin Yu, Xiaoyang Lyu, and Xiaojuan Qi. Hybrid neural rendering for large-scale scenes with motion blur. In CVPR, 2023.
  • Dou et al. [2016] Mingsong Dou, Sameh Khamis, Yury Degtyarev, Philip Davidson, Sean Ryan Fanello, Adarsh Kowdle, Sergio Orts Escolano, Christoph Rhemann, David Kim, Jonathan Taylor, et al. Fusion4d: Real-time performance capture of challenging scenes. ACM TOG, 35(4):1–13, 2016.
  • Fang et al. [2022] Jiemin Fang, Taoran Yi, Xinggang Wang, Lingxi Xie, Xiaopeng Zhang, Wenyu Liu, Matthias Nießner, and Qi Tian. Fast dynamic radiance fields with time-aware neural voxels. In ACM SIGGRAPH ASIA, 2022.
  • Fridovich-Keil et al. [2022] Sara Fridovich-Keil, Alex Yu, Matthew Tancik, Qinhong Chen, Benjamin Recht, and Angjoo Kanazawa. Plenoxels: Radiance fields without neural networks. In CVPR, 2022.
  • Gao et al. [2021] Chen Gao, Ayush Saraf, Johannes Kopf, and Jia-Bin Huang. Dynamic view synthesis from dynamic monocular video. In ICCV, 2021.
  • Gao et al. [2019] Lin Gao, Yu-Kun Lai, Jie Yang, Ling-Xiao Zhang, Shihong Xia, and Leif Kobbelt. Sparse data driven mesh deformation. IEEE TVCG, 27(3):2085–2100, 2019.
  • Gao and Tedrake [2018] Wei Gao and Russ Tedrake. Surfelwarp: Efficient non-volumetric single view dynamic reconstruction. Robotics: Science and Systems XIV, 2018.
  • Gao et al. [2023] Yiming Gao, Yan-Pei Cao, and Ying Shan. Surfelnerf: Neural surfel radiance fields for online photorealistic reconstruction of indoor scenes. In CVPR, 2023.
  • Guo et al. [2023] Xiang Guo, Jiadai Sun, Yuchao Dai, Guanying Chen, Xiaoqing Ye, Xiao Tan, Errui Ding, Yumeng Zhang, and **gdong Wang. Forward flow for novel view synthesis of dynamic scenes. In ICCV, 2023.
  • Huang et al. [2023] Yi-Hua Huang, Yan-Pei Cao, Yu-Kun Lai, Ying Shan, and Lin Gao. Nerf-texture: Texture synthesis with neural radiance fields. In ACM SIGGRAPH, pages 1–10, 2023.
  • Kerbl et al. [2023] Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering. ACM TOG, 42(4):1–14, 2023.
  • Keselman and Hebert [2022] Leonid Keselman and Martial Hebert. Approximate differentiable rendering with algebraic surfaces. In ECCV, 2022.
  • Keselman and Hebert [2023] Leonid Keselman and Martial Hebert. Flexible techniques for differentiable rendering with 3d gaussians. arXiv preprint arXiv:2308.14737, 2023.
  • Lee et al. [2024] Byeonghyeon Lee, Howoong Lee, Xiangyu Sun, Usman Ali, and Eunbyung Park. Deblurring 3d gaussian splatting. arXiv preprint arXiv:2401.00834, 2024.
  • Li et al. [2009] Hao Li, Bart Adams, Leonidas J Guibas, and Mark Pauly. Robust single-view geometry and motion reconstruction. ACM TOG, 28(5):1–10, 2009.
  • Li et al. [2022] Tianye Li, Mira Slavcheva, Michael Zollhoefer, Simon Green, Christoph Lassner, Changil Kim, Tanner Schmidt, Steven Lovegrove, Michael Goesele, Richard Newcombe, et al. Neural 3d video synthesis from multi-view video. In CVPR, 2022.
  • Li et al. [2021] Zhengqi Li, Simon Niklaus, Noah Snavely, and Oliver Wang. Neural scene flow fields for space-time view synthesis of dynamic scenes. In CVPR, 2021.
  • Li et al. [2023] Zhengqi Li, Qianqian Wang, Forrester Cole, Richard Tucker, and Noah Snavely. Dynibar: Neural dynamic image-based rendering. In CVPR, 2023.
  • Lin et al. [2023a] Gao Lin, Liu Feng-Lin, Chen Shu-Yu, Jiang Kaiwen, Li Chunpeng, Yukun Lai, and Fu Hongbo. Sketchfacenerf: Sketch-based facial generation and editing in neural radiance fields. ACM TOG, 2023a.
  • Lin et al. [2022] Haotong Lin, Sida Peng, Zhen Xu, Yunzhi Yan, Qing Shuai, Hujun Bao, and Xiaowei Zhou. Efficient neural radiance fields for interactive free-viewpoint video. In ACM SIGGRAPH ASIA, pages 1–9, 2022.
  • Lin et al. [2023b] Haotong Lin, Sida Peng, Zhen Xu, Tao Xie, Xingyi He, Hujun Bao, and Xiaowei Zhou. High-fidelity and real-time novel view synthesis for dynamic scenes. In ACM SIGGRAPH ASIA, pages 1–9, 2023b.
  • Lipman et al. [2005] Yaron Lipman, Olga Sorkine-Hornung, Marc Alexa, Daniel Cohen-Or, David Levin, Christian Rössl, and Hans-Peter Seidel. Laplacian framework for interactive mesh editing. Int. J. Shape Model., 11:43–62, 2005.
  • Liu et al. [2022] Jia-Wei Liu, Yan-Pei Cao, Weijia Mao, Wenqiao Zhang, David Junhao Zhang, Jussi Keppo, Ying Shan, Xiaohu Qie, and Mike Zheng Shou. Devrf: Fast deformable voxel radiance fields for dynamic scenes. In NeurIPS, 2022.
  • Liu et al. [2023] Yu-Tao Liu, Li Wang, Jie Yang, Weikai Chen, Xiaoxu Meng, Bo Yang, and Lin Gao. Neudf: Leaning neural unsigned distance fields with volume rendering. In CVPR, 2023.
  • Lombardi et al. [2021] Stephen Lombardi, Tomas Simon, Gabriel Schwartz, Michael Zollhoefer, Yaser Sheikh, and Jason Saragih. Mixture of volumetric primitives for efficient neural rendering. ACM TOG, 40(4):1–13, 2021.
  • Luiten et al. [2024] Jonathon Luiten, Georgios Kopanas, Bastian Leibe, and Deva Ramanan. Dynamic 3d gaussians: Tracking by persistent dynamic view synthesis. In 3DV, 2024.
  • Lyu et al. [2023] Xiaoyang Lyu, Peng Dai, Zizhang Li, Dongyu Yan, Yi Lin, Yifan Peng, and Xiaojuan Qi. Learning a room with the occ-sdf hybrid: Signed distance function mingled with occupancy aids scene representation. In ICCV, 2023.
  • Mildenhall et al. [2021] Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM, 65(1):99–106, 2021.
  • Müller et al. [2022] Thomas Müller, Alex Evans, Christoph Schied, and Alexander Keller. Instant neural graphics primitives with a multiresolution hash encoding. ACM TOG, 41(4):1–15, 2022.
  • Newcombe et al. [2015] Richard A Newcombe, Dieter Fox, and Steven M Seitz. Dynamicfusion: Reconstruction and tracking of non-rigid scenes in real-time. In CVPR, 2015.
  • Park et al. [2021a] Keunhong Park, Utkarsh Sinha, Jonathan T Barron, Sofien Bouaziz, Dan B Goldman, Steven M Seitz, and Ricardo Martin-Brualla. Nerfies: Deformable neural radiance fields. In ICCV, 2021a.
  • Park et al. [2021b] Keunhong Park, Utkarsh Sinha, Peter Hedman, Jonathan T. Barron, Sofien Bouaziz, Dan B Goldman, Ricardo Martin-Brualla, and Steven M. Seitz. Hypernerf: A higher-dimensional representation for topologically varying neural radiance fields. In ACM TOG, 2021b.
  • Park et al. [2023] Sungheon Park, Minjung Son, Seokhwan Jang, Young Chun Ahn, Ji-Yeon Kim, and Nahyup Kang. Temporal interpolation is all you need for dynamic neural radiance fields. In CVPR, 2023.
  • Peng et al. [2023] Sida Peng, Yunzhi Yan, Qing Shuai, Hujun Bao, and Xiaowei Zhou. Representing volumetric videos as dynamic mlp maps. In CVPR, 2023.
  • Pumarola et al. [2021] Albert Pumarola, Enric Corona, Gerard Pons-Moll, and Francesc Moreno-Noguer. D-nerf: Neural radiance fields for dynamic scenes. In CVPR, 2021.
  • Sara Fridovich-Keil and Giacomo Meanti et al. [2023] Sara Fridovich-Keil and Giacomo Meanti, Frederik Rahbæk Warburg, Benjamin Recht, and Angjoo Kanazawa. K-planes: Explicit radiance fields in space, time, and appearance. In CVPR, 2023.
  • Schönberger and Frahm [2016] Johannes Lutz Schönberger and Jan-Michael Frahm. Structure-from-motion revisited. In CVPR, 2016.
  • Shao et al. [2023] Ruizhi Shao, Zerong Zheng, Hanzhang Tu, Boning Liu, Hongwen Zhang, and Yebin Liu. Tensor4d: Efficient neural 4d decomposition for high-fidelity dynamic reconstruction and rendering. In CVPR, 2023.
  • Sorkine and Alexa [2007] Olga Sorkine and Marc Alexa. As-rigid-as-possible surface modeling. In Symposium on Geometry Processing, pages 109–116. Citeseer, 2007.
  • Sorkine-Hornung [2005] Olga Sorkine-Hornung. Laplacian mesh processing. In Eurographics, 2005.
  • Sorkine-Hornung et al. [2004] Olga Sorkine-Hornung, Daniel Cohen-Or, Yaron Lipman, Marc Alexa, Christian Rössl, and Hans-Peter Seidel. Laplacian surface editing. In Eurographics Symposium on Geometry Processing, 2004.
  • Sumner et al. [2007] Robert W Sumner, Johannes Schmid, and Mark Pauly. Embedded deformation for shape manipulation. In ACM SIGGRAPH, pages 80–es. 2007.
  • Tretschk et al. [2021] Edgar Tretschk, Ayush Tewari, Vladislav Golyanik, Michael Zollhöfer, Christoph Lassner, and Christian Theobalt. Non-rigid neural radiance fields: Reconstruction and novel view synthesis of a dynamic scene from monocular video. In ICCV, 2021.
  • Wang et al. [2023a] Cong Wang, Di Kang, Yan-Pei Cao, Linchao Bao, Ying Shan, and Song-Hai Zhang. Neural point-based volumetric avatar: Surface-guided neural points for efficient and photorealistic volumetric head avatar. In ACM SIGGRAPH ASIA, pages 1–12, 2023a.
  • Wang et al. [2023b] Feng Wang, Sinan Tan, ** Liu. Mixed neural voxels for fast multi-view video synthesis. In ICCV, 2023b.
  • Wang et al. [2023c] Liao Wang, Qiang Hu, Qihan He, Ziyu Wang, **gyi Yu, Tinne Tuytelaars, Lan Xu, and Minye Wu. Neural residual radiance fields for streamably free-viewpoint videos. In CVPR, 2023c.
  • Wang et al. [2021] Peng Wang, Lingjie Liu, Yuan Liu, Christian Theobalt, Taku Komura, and Wen** Wang. Neus: Learning neural implicit surfaces by volume rendering for multi-view reconstruction. NeurIPS, 34, 2021.
  • Wu et al. [2023] Guanjun Wu, Taoran Yi, Jiemin Fang, Lingxi Xie, Xiaopeng Zhang, Wei Wei, Wenyu Liu, Qi Tian, and Wang Xinggang. 4d gaussian splatting for real-time dynamic scene rendering. arXiv preprint arXiv:2310.08528, 2023.
  • Wu et al. [2024] Tong Wu, Yu-Jie Yuan, Ling-Xiao Zhang, Jie Yang, Yan-Pei Cao, Ling-Qi Yan, and Lin Gao. Recent advances in 3d gaussian splatting. arXiv preprint arXiv:2403.11134, 2024.
  • Xian et al. [2021] Wenqi Xian, Jia-Bin Huang, Johannes Kopf, and Changil Kim. Space-time neural irradiance fields for free-viewpoint video. In CVPR, 2021.
  • Xu et al. [2022] Qiangeng Xu, Zexiang Xu, Julien Philip, Sai Bi, Zhixin Shu, Kalyan Sunkavalli, and Ulrich Neumann. Point-nerf: Point-based neural radiance fields. In CVPR, 2022.
  • Xu and Harada [2022] Tianhan Xu and Tatsuya Harada. Deforming radiance fields with cages. In ECCV, 2022.
  • Xu et al. [2023] Zhen Xu, Sida Peng, Haotong Lin, Guangzhao He, Jiaming Sun, Yujun Shen, Hujun Bao, and Xiaowei Zhou. 4k4d: Real-time 4d view synthesis at 4k resolution. arXiv preprint arXiv:2310.11448, 2023.
  • Yan et al. [2023] Zhiwen Yan, Chen Li, and Gim Hee Lee. Nerf-ds: Neural radiance fields for dynamic specular objects. In CVPR, 2023.
  • Yang et al. [2023a] Ziyi Yang, Yanzhen Chen, Xinyu Gao, Yazhen Yuan, Yu Wu, Xiaowei Zhou, and Xiaogang **. Sire-ir: Inverse rendering for brdf reconstruction with shadow and illumination removal in high-illuminance scenes. arXiv preprint arXiv:2310.13030, 2023a.
  • Yang et al. [2023b] Ziyi Yang, Xinyu Gao, Wen Zhou, Shaohui Jiao, Yuqing Zhang, and Xiaogang **. Deformable 3d gaussians for high-fidelity monocular dynamic scene reconstruction. arXiv preprint arXiv:2309.13101, 2023b.
  • Yang et al. [2023c] Zeyu Yang, Hongye Yang, Zijie Pan, Xiatian Zhu, and Li Zhang. Real-time photorealistic dynamic scene representation and rendering with 4d gaussian splatting. arXiv preprint arXiv 2310.10642, 2023c.
  • Yang et al. [2024] Ziyi Yang, Xinyu Gao, Yangtian Sun, Yihua Huang, Xiaoyang Lyu, Wen Zhou, Shaohui Jiao, Xiaojuan Qi, and Xiaogang **. Spec-gaussian: Anisotropic view-dependent appearance for 3d gaussian splatting. arXiv preprint arXiv:2402.15870, 2024.
  • Yifan et al. [2020] Wang Yifan, Noam Aigerman, Vladimir G. Kim, Siddhartha Chaudhuri, and Olga Sorkine-Hornung. Neural cages for detail-preserving 3D deformations. In CVPR, 2020.
  • Yu et al. [2021] Alex Yu, Ruilong Li, Matthew Tancik, Hao Li, Ren Ng, and Angjoo Kanazawa. Plenoctrees for real-time rendering of neural radiance fields. In ICCV, 2021.
  • Yu et al. [2004] Yizhou Yu, Kun Zhou, Dong Xu, Xiaohan Shi, Hujun Bao, Baining Guo, and Heung-Yeung Shum. Mesh editing with Poisson-based gradient field manipulation. In ACM SIGGRAPH, pages 644–651. 2004.
  • Yuan et al. [2022] Yu-Jie Yuan, Yang-Tian Sun, Yu-Kun Lai, Yuewen Ma, Rongfei Jia, and Lin Gao. Nerf-editing: Geometry editing of neural radiance fields. CVPR, 2022.
  • Yuan et al. [2023] Yu-Jie Yuan, Yang-Tian Sun, Yu-Kun Lai, Yuewen Ma, Rongfei Jia, Leif Kobbelt, and Lin Gao. Interactive nerf geometry editing with shape priors. IEEE TPAMI, 2023.
  • Yunus et al. [2024] Raza Yunus, Jan Eric Lenssen, Michael Niemeyer, Yiyi Liao, Christian Rupprecht, Christian Theobalt, Gerard Pons-Moll, Jia-Bin Huang, Vladislav Golyanik, and Eddy Ilg. Recent trends in 3d reconstruction of general non-rigid scenes. In Comput. Graph. Forum. Blackwell-Wiley, 2024.
  • Zhang et al. [2022] Qiang Zhang, Seung-Hwan Baek, Szymon Rusinkiewicz, and Felix Heide. Differentiable point-based radiance fields for efficient view synthesis. In ACM SIGGRAPH ASIA, pages 1–12, 2022.
  • Zhang et al. [2018] Richard Zhang, Phillip Isola, Alexei A. Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. 2018.
  • Zhang et al. [2020] Yuzhe Zhang, Jianmin Zheng, and Yiyu Cai. Proxy-driven free-form deformation by topology-adjustable control lattice. Computers & Graphics, 89:167–177, 2020.
  • Zheng et al. [2023] Chengwei Zheng, Wenbin Lin, and Feng Xu. Editablenerf: Editing topologically varying neural radiance fields by key points. In CVPR, 2023.