\floatsetup

[figure]style=plain,subcapbesideposition=center

EgoGaussian: Dynamic Scene Understanding from Egocentric Video with 3D Gaussian Splatting

Daiwei Zhang1  Gengyan Li1,2  Jiajie Li1  Mickaël Bressieux1
Otmar Hilliges1Marc Pollefeys1,3Luc Van Gool1,4,5Xi Wang1
1ETH Zürich  2Google  3Microsoft  4KU Leuven  5INSAIT, Sofia
Abstract

Human activities are inherently complex, and even simple household tasks involve numerous object interactions. To better understand these activities and behaviors, it is crucial to model their dynamic interactions with the environment. The recent availability of affordable head-mounted cameras and egocentric data offers a more accessible and efficient means to understand dynamic human-object interactions in 3D environments. However, most existing methods for human activity modeling either focus on reconstructing 3D models of hand-object or human-scene interactions or on map** 3D scenes, neglecting dynamic interactions with objects. The few existing solutions often require inputs from multiple sources, including multi-camera setups, depth-sensing cameras, or kinesthetic sensors. To this end, we introduce EgoGaussian, the first method capable of simultaneously reconstructing 3D scenes and dynamically tracking 3D object motion from RGB egocentric input alone. We leverage the uniquely discrete nature of Gaussian Splatting and segment dynamic interactions from the background. Our approach employs a clip-level online learning pipeline that leverages the dynamic nature of human activities, allowing us to reconstruct the temporal evolution of the scene in chronological order and track rigid object motion. Additionally, our method automatically segments object and background Gaussians, providing 3D representations for both static scenes and dynamic objects. EgoGaussian outperforms previous NeRF and Dynamic Gaussian methods in challenging in-the-wild videos and we also qualitatively demonstrate the high quality of the reconstructed models. 111Project Website

1 Introduction

Human activities are inherently complex and performing simple household tasks involves numerous interactions with objects. For example, making a coffee in the morning involves multiple steps: taking a mug from a shelf, placing it under the coffee machine, pressing a button for the preferred type of coffee, and adding milk or sugar. Even this seemingly simple task includes various object interactions and movements. To better understand human activities and behaviors, it is important to be able to model these dynamic interactions with the environment. The recent availability of affordable head-mounted cameras [44, 53] and egocentric data [11, 20, 21, 38] offers a more accessible and efficient means to understand dynamic human-object interactions in 3D environments. Toward this goal, we tackle the challenging task of reconstructing 3D scenes and dynamic interactions of objects from RGB egocentric videos.

Most existing methods for modeling human-object interactions either focus on reconstructing 3D hand-object [16, 33, 61, 69] or human-scene interaction models [26, 1, 28, 71, 32, 76] or on map** 3D scenes [56]. These approaches often neglect dynamic interactions with objects, resulting in static representations with motion-induced artifacts, commonly known as the “ghost effect”. The few existing solutions often require inputs from multiple sources, including multi-camera setups [37], depth-sensing cameras [62], or kinesthetic sensors [25]. While these methods achieve 3D reconstruction, they do not consider changes caused by interactions and thus fail to capture the dynamics depicted in egocentric videos.

In this paper, we go beyond prior works to tackle the task of dynamic scene reconstruction from RGB egocentric videos. Our proposed method EgoGaussian simultaneously reconstructs 3D scenes and dynamically tracks 3D object motions within them. Our key insight is that the uniquely discrete nature of Gaussian Splatting makes it especially suitable for spatial segmentation, allowing objects to be trained separately from the background. Given that human activities involve continuous motion over time, we identify critical contact points in time and distinguish dynamic interactions from static captures that only contain camera movements. We propose a clip-level online learning pipeline that leverages the dynamic nature of human activities, allowing us to reconstruct the temporal evolution of the scene in chronological order and track rigid object motion.

To reconstruct the dynamic scenes from an egocentric video, EgoGaussian first obtains hand-object segmentation using an off-the-shelf method and derives camera poses through structure-from-motion. By leveraging the natural trajectories of interactions, we partition the input video into static and dynamic clips. The static clips are used to reconstruct the background scenes and initialize the shapes of the object that will be interacted with. Subsequently, we refine the object’s shapes and track their motion through the dynamic clips. We empirically show that EgoGaussian achieves better reconstruction of dynamic scenes than previous NeRF and Dynamic Gaussian methods. We quantitatively evaluate our method on two in-the-wild egocentric video datasets following the evaluation protocol for novel-view synthesis. We also qualitatively demonstrate the high quality of the reconstructed scenes and the tracked object shapes and their motion.

Our main contributions can be summarized as follows: 1) We present a novel method that accurately reconstructs 3D scenes and dynamic object motion within them from RGB egocentric videos. 2) We leverage the dynamic nature of interactions that consist of transitions between static and dynamic phases, which facilitates the reconstruction of the static scenes, the object shapes, and the tracking of their motion. 3) Through both qualitative and quantitative evaluation, we demonstrate that our method outperforms previous approaches and provides better 4D reconstruction that captures the dynamic object interactions. Our model and code will be released upon acceptance.

2 Related Work

Hand-Object Segmentation. Many works have studied hand-object interaction in egocentric vision from different aspects. One significant area of focus is segmentation, specifically obtaining image segmentation masks of hands and the objects they hold. Ren et al. [49] proposed a motion-based approach to robustly segment both hand and object using optical flow and domain-specific cues from egocentric video. Concurrent with the emergence of deep neural networks-based hand-object segmentation is the scaling-up of egocentric data that includes pixel-level annotations and involves diverse daily activities [11, 12, 20]. VISOR[13] annotates videos from EPIC-KITCHENS[11, 12] dataset and provides masks for 67k hand-object relations covering 36 hours of videos. EgoHOS [74] further introduces the notion of a dense contact boundary to explicitly model the interaction and a context-aware compositional data augmentation technique to generate semantically consistent hand-object segmentation on out-of-distribution egocentric videos. Cheng et al. [9] produces a rich, unified 2D output of interaction by converting predicted bounding boxes to segments with Segment Anything (SAM) [30]. Our method takes egocentric videos with hand-object segmentation masks as input and creates dynamic 3D models.

Hand-Object Reconstruction. Another highly related direction is to reconstruct the hand-object interaction, featuring 3D pose estimation for hands and objects. Recent works often jointly reconstruct hands and objects to favor physically plausible interactions [16, 33, 61, 69, 70, 78]. These approaches can be grouped into two categories. One assumes a known 3D object model and fits that model into 2D image [6, 10, 33, 55, 67]. For example, RHO [6] adapts an optimization-based approach that’s able to reconstruct hands and objects from single images in the wild, by leveraging 2D image cues and 3D contact priors to provide constraints. Recent works eliminate the need for a known 3D model and directly reconstruct 3d object shapes from the input [16, 48, 70]. However, they either require multiview input [48], specific hand-object interaction supervision [70], or can only reconstruct simple object shapes [16]. Current shape-agnostic methods struggle in in-the-wild scenarios [16, 78]. In contrast, our method does not require prior knowledge and obtains 3D object shapes through differentiable 3D Gaussian-based rendering.

Static Scene Modeling. In the past few years, the domain of static scene modeling has garnered considerable attention. Mildenhall et al. [42] introduce the groundbreaking Neural Radiance Fields (NeRF), which utilizes a large Multilayer Perceptron (MLP) to represent 3D scenes and renders via volume rendering technique. However, their method queries the MLP at hundreds of points for each ray, resulting in slow training and rendering speed. Additionally, the original NeRF’s performance can diminish in scenes with highly dynamic elements due to its static, volumetric nature. Therefore, some subsequent works have tried to enhance the quality by (1) mitigating existing problems, such as aliasing [3, 4, 8] and reflection [24, 58] (2) incorporating image processing [41, 39] (3) employing per-image transient latent codes [40, 51], and (4) introducing supervision of expected depth with point clouds [14, 65]. There also exist some other follow-up works aiming to improve the speed, for example, by caching precomputed MLP results [27, 72], employing well-designed data structures [7, 54], removing the neural network [18], or utilizing multi-resolution hash encoding [43]. Yet, most of these methods still use ray marching, which involves sampling millions of points and slows down real-time rendering. Recently, Kerbl et al. [29] propose a different approach in the modeling and rendering of complex static 3D scenes - 3D Gaussian Splatting (3DGS). They model static scenes with Gaussians whose position, opacity, shape, and color are learned through a differentiable splatting-based renderer, achieving real-time rendering speed.

Dynamic Scene Modeling. Motivated by the success of NeRF [42] in static scene modeling, numerous studies have adopted neural representations to model dynamic scenes. One strategy to extend 3D into 4D scenes is by using time stamps as an additional conditioning factor. [2, 64]. Another set of ’dynamic NeRF’ works [5, 17, 50, 52, 59] involves employing 4D space-time grid-based representations. Representing the 3D scene at a certain timestamp as a canonical space and then explicitly modeling deformation fields to warp 3D points into the canonical space is a common strategy as well [15, 34, 45, 47]. Another strategy is to combine the two approaches, using a conditional neural volume together with a deformation field [31, 46]. However, these methods all suffer from the same issues as static NeRFs, in that they require raymarching and despite advances in performance, still are not sufficiently fast for real-time rendering. Similarly, many dynamic extensions to 3D Gaussian splatting were also proposed [36, 37, 63, 68]. The most common approach is to learn for every timestep a set of deformations for each Gaussians. This can be done explicitly, or implicitly using a deformation which is evaluated for each Gaussian. This results in substantially faster training and rendering speed, with comparable levels of rendering quality. Although these methods result in decent quality renders, upon closer inspection all of them result in noticeably blurrier results than are possible with static reconstructions, especially when strong motion is involved.

3 Methodology

Figure 1 summarizes our method, EgoGaussian, for dynamic scene reconstruction from RGB egocentric videos. EgoGaussian first obtains camera poses and hand-object segmentation masks and the segmentation masks are further used to partition the videos into static and dynamic clips (Sec. 3.2). The static clips are used to initialize the static background and object shapes (Sec. 3.3), while the dynamic clips are used to track object motion and gradually refine their shapes (Sec. 3.4.

Refer to caption
Figure 1: EgoGaussian Pipeline. Given an egocentric video input, our framework first estimates camera poses via structure-from-motion and obtains hand-object segmentation masks using an off-the-shelf approach. We also partition the video input into static and dynamic clips in the preprocessing step. The static clips are used to reconstruct the background scenes and initialize the shapes of the object that will be interacted with. Subsequently, we refine the object’s shapes and track their motion through the dynamic clips.

3.1 Preliminary: 3D Gaussian Splatting

We use 3D Gaussian Splatting (3D-GS) as our modeling structure because it provides an explicit 3D scene representation with a set of point-cloud-like 3D Gaussians. Each Gaussian is characterized by a position (mean) μ𝜇\mathbf{\mu}italic_μ, a covariance matrix ΣΣ\Sigmaroman_Σ, an opacity, and color features 𝐜𝐜\mathbf{c}bold_c. The Gaussians are defined using the standard multivariate Gaussian distribution G(𝐱)=e12(𝐱μ)TΣ1(𝐱μ)𝐺𝐱superscript𝑒12superscript𝐱𝜇𝑇superscriptΣ1𝐱𝜇G(\mathbf{x})=e^{-\frac{1}{2}(\mathbf{x}-\mathbf{\mu})^{T}\Sigma^{-1}(\mathbf{% x}-\mathbf{\mu})}italic_G ( bold_x ) = italic_e start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( bold_x - italic_μ ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT roman_Σ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( bold_x - italic_μ ) end_POSTSUPERSCRIPT, providing a flexible optimization framework and a compact 3D scene representation.

3D Gaussian Splatting utilizes a differentiable point-based α𝛼\alphaitalic_α-blending rendering to compute the color C𝐶Citalic_C of pixel 𝐱psubscript𝐱𝑝\mathbf{x}_{p}bold_x start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT. Specifically, it adapts a typical neural point-based approach and blends N𝑁Nitalic_N ordered points overlap** the pixel: C(𝐱p)=i𝒩𝐜iαij=1i1(1αj)𝐶subscript𝐱𝑝subscript𝑖𝒩subscript𝐜𝑖subscript𝛼𝑖superscriptsubscriptproduct𝑗1𝑖11subscript𝛼𝑗C(\mathbf{x}_{p})=\sum_{i\in\mathcal{N}}\mathbf{c}_{i}\alpha_{i}\prod_{j=1}^{i% -1}\left(1-\alpha_{j}\right)italic_C ( bold_x start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_N end_POSTSUBSCRIPT bold_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∏ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i - 1 end_POSTSUPERSCRIPT ( 1 - italic_α start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ), where αisubscript𝛼𝑖\alpha_{i}italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is calculated by evaluating a 2D Gaussian with covariance ΣΣ\Sigmaroman_Σ, projected from the 3D Gaussian, and then multiplied with its opacity; and 𝐜isubscript𝐜𝑖\mathbf{c}_{i}bold_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the color of each Gaussian. The original 3D-GS implementation treats color as a directional appearance component represented via spherical harmonics (SH). For simplicity, we disable the view-dependent color by setting the maximum SH degree to 00.

3.2 Data preprocessing

Camera pose estimation. For egocentric dataset where camera poses are already computed and made public, e.g., EPIC Fields [56] provides estimated camera poses for EPIC-KITCHENS [11], we will employ them directly; otherwise, we estimate camera poses ourselves. We will provide a frame filtering pipeline here to alleviate the computational load of running COLMAP on lengthy videos. COLMAP’s SfM also creates a sparse point cloud corresponding to the camera poses estimated, which we use as an initialization for 3D Gaussian Splatting.

Hand-object segmentation. We require pixel-level segmentation of hand-object interaction either provided [35] or estimated beforehand. The onset and offset frames of each hand-object interaction should also be estimated throughout the video. For the HOI4D dataset this is provided, whereas for the EPIC-KITCHENS dataset we use EgoHOS [74] to generate hand masks and Track-Anything model [66] for object masks and human body masks. Furthermore, these masks are dilated by 2 pixels for better robustness.

Video partitioning. We then partition the egocentric video into static and dynamic clips according to the onset and offset of interactions. We define static clips as ones where only the actor’s hands or body are moving, but objects are all static, while dynamic clips contain both actor and object motion.

3.3 Static clip reconstruction

As some other works have pointed out, 3D Gaussian Splatting tends to overfit to training views by generating excessive floaters when there are scene inconsistencies among 3D views [22]. In order to eliminate such inconsistencies, any objects that move at all in the scene should be identified and masked out to provide a purely static scene reconstruction. Currently moving objects can be identified using the masks mentioned in the previous section. We explain in the following sections how we also identify objects that have moved or will move, and we provide an illustration of this process in Figure 2

Refer to caption
Figure 2: Static reconstruction pipeline. Given a static clip partitioned from egocentric video, we adopt a standard 3D Gaussian Splatting training schema first. Based on the trained Gaussians, we then train an object identity variable with respect to 5 object masks, so that object Gaussians can be detached from the background Gaussians.

Initial Static training. We first train a static version of the scene with each static clip. To do so, we use a set of M𝑀Mitalic_M observations/frames from this clip Si={𝐈sj,𝐇sj,θsj|S_{i}=\{\mathbf{I}_{s_{j}},\mathbf{H}_{s_{j}},\theta_{s_{j}}|italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = { bold_I start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT , bold_H start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT |  j=1,,M}j=1,\dots,M\}italic_j = 1 , … , italic_M }, where 𝐈jsubscript𝐈𝑗\mathbf{I}_{j}bold_I start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT is an input RGB egocentric frame, 𝐇jsubscript𝐇𝑗\mathbf{H}_{j}bold_H start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT is the binary hand/body mask where pixel value =0absent0=0= 0 represents body part and pixel value =1absent1=1= 1 is for rest of the frame, and θjsubscript𝜃𝑗\theta_{j}italic_θ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT is the corresponding camera parameters for frame j𝑗jitalic_j. We then follow the similar optimization pipeline as the original 3DGS, including pruning and densification. We use a masked version of their loss function:

=(1λ)1(𝐈input,𝐈render)+λD-SSIM (𝐈input,𝐈render),1𝜆subscript1subscript𝐈inputsubscript𝐈render𝜆subscriptD-SSIM subscript𝐈inputsubscript𝐈render\mathcal{L}=(1-\lambda)\mathcal{L}_{1}\left(\mathbf{I}_{\text{input}},\mathbf{% I}_{\text{render}}\right)+\lambda\mathcal{L}_{\text{D-SSIM }}\left(\mathbf{I}_% {\text{input}},\mathbf{I}_{\text{render}}\right),caligraphic_L = ( 1 - italic_λ ) caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_I start_POSTSUBSCRIPT input end_POSTSUBSCRIPT , bold_I start_POSTSUBSCRIPT render end_POSTSUBSCRIPT ) + italic_λ caligraphic_L start_POSTSUBSCRIPT D-SSIM end_POSTSUBSCRIPT ( bold_I start_POSTSUBSCRIPT input end_POSTSUBSCRIPT , bold_I start_POSTSUBSCRIPT render end_POSTSUBSCRIPT ) , (1)

with the gradients zeroed out according to the hand/body mask 𝐇𝐇\mathbf{H}bold_H. Similar to SuGaR [23], after around 30K iterations, we append an additional entropy loss on the opacity α𝛼\alphaitalic_α of Gaussians, i.e.

entropyα=αlog(α)(1α)log(1α),subscriptsubscriptentropy𝛼𝛼𝛼1𝛼1𝛼\mathcal{L}_{\text{entropy}_{\alpha}}=-\alpha\log(\alpha)-(1-\alpha)\log(1-% \alpha),caligraphic_L start_POSTSUBSCRIPT entropy start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT end_POSTSUBSCRIPT = - italic_α roman_log ( italic_α ) - ( 1 - italic_α ) roman_log ( 1 - italic_α ) , (2)

as a way to enforce Gaussians to be either fully transparent or completely opaque and train for another 10K iterations while disabling pruning and densification. Instead, we prune the transparent Gaussians once at the end of this phase of training.

Object identity training. This produces a set of 3D Gaussians 𝒢𝒢\mathcal{G}caligraphic_G reconstructing the scene captured by this static clip, which includes both the static background and any objects that might move during dynamic portions of the video. As previously mentioned, we require masks not only for currently moving objects, but objects that move at any time in the video. As these objects are static and not segmented during static clips, we instead use the masks from the closest adjacent dynamic frame, and copy them forwards and backwards in time. For example, for a static clip S1subscript𝑆1S_{1}italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT with M𝑀Mitalic_M frames between two dynamic ones where different objects are moved, masks for two different objects A and B should be determined for the first and last 5 frames of this clip respectively: {𝐎~A,1,𝐎~A,5}subscript~𝐎𝐴1subscript~𝐎𝐴5\{\tilde{\mathbf{O}}_{A,1},\dots\tilde{\mathbf{O}}_{A,5}\}{ over~ start_ARG bold_O end_ARG start_POSTSUBSCRIPT italic_A , 1 end_POSTSUBSCRIPT , … over~ start_ARG bold_O end_ARG start_POSTSUBSCRIPT italic_A , 5 end_POSTSUBSCRIPT }, {𝐎~B,M4,𝐎~B,M}subscript~𝐎𝐵𝑀4subscript~𝐎𝐵𝑀\{\tilde{\mathbf{O}}_{B,M-4},\dots\tilde{\mathbf{O}}_{B,M}\}{ over~ start_ARG bold_O end_ARG start_POSTSUBSCRIPT italic_B , italic_M - 4 end_POSTSUBSCRIPT , … over~ start_ARG bold_O end_ARG start_POSTSUBSCRIPT italic_B , italic_M end_POSTSUBSCRIPT }. Although this assumes that the camera moves minimally during these transitions, this works well in practice for all tested settings. An additional trainable parameter of label l𝑙litalic_l is then attached to each Gaussian and initialized to a very small value. This label can then be rendered similar to the RGB value as: L(𝐱p)=i𝒩liαij=1i1(1αj)𝐿subscript𝐱𝑝subscript𝑖𝒩subscript𝑙𝑖subscript𝛼𝑖superscriptsubscriptproduct𝑗1𝑖11subscript𝛼𝑗L(\mathbf{x}_{p})=\sum_{i\in\mathcal{N}}l_{i}\alpha_{i}\prod_{j=1}^{i-1}\left(% 1-\alpha_{j}\right)italic_L ( bold_x start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_N end_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∏ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i - 1 end_POSTSUPERSCRIPT ( 1 - italic_α start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ). This produces a segmentation mask Lrendersubscript𝐿𝑟𝑒𝑛𝑑𝑒𝑟L_{render}italic_L start_POSTSUBSCRIPT italic_r italic_e italic_n italic_d italic_e italic_r end_POSTSUBSCRIPT, upon which we can apply a binary cross entropy loss using the dynamic object masks 𝐎~~𝐎\tilde{\mathbf{O}}over~ start_ARG bold_O end_ARG we previous discussed:

BCEWithLogitsLoss=[(1𝐎~)log(σ(𝐈render, obj))+𝐎~log(1σ(𝐈render, obj)))],\mathcal{L}_{\text{BCEWithLogitsLoss}}=-\left[(1-\tilde{\mathbf{O}})\cdot\log(% \sigma(\mathbf{I}_{\text{render, obj}}))+\tilde{\mathbf{O}}\cdot\log(1-\sigma(% \mathbf{I}_{\text{render, obj}})))\right],caligraphic_L start_POSTSUBSCRIPT BCEWithLogitsLoss end_POSTSUBSCRIPT = - [ ( 1 - over~ start_ARG bold_O end_ARG ) ⋅ roman_log ( italic_σ ( bold_I start_POSTSUBSCRIPT render, obj end_POSTSUBSCRIPT ) ) + over~ start_ARG bold_O end_ARG ⋅ roman_log ( 1 - italic_σ ( bold_I start_POSTSUBSCRIPT render, obj end_POSTSUBSCRIPT ) ) ) ] , (3)

where σ(X)=11+eX𝜎𝑋11superscript𝑒𝑋\sigma(X)=\frac{1}{1+e^{-X}}italic_σ ( italic_X ) = divide start_ARG 1 end_ARG start_ARG 1 + italic_e start_POSTSUPERSCRIPT - italic_X end_POSTSUPERSCRIPT end_ARG is the sigmoid function. This allows us to separate the Gaussians into object and background Gaussians based on thresholding σ(l)𝜎𝑙\sigma(l)italic_σ ( italic_l ) and can be rendered to compute movable object masks.

Combination of static and dynamic clips. The previously mentioned static scene still includes objects that are only static for some parts of the video. However, parts of the background scene are obscured by these objects, and only become visible during dynamic clips. We therefore retrain the static scene using the movable object masks in order to mask out objects that move during any time in the video. This allows us to train a more complete static scene which includes portions of the background that are only visible during dynamic parts of the video.

3.4 Dynamic object modeling

Object pose estimation. The pre-trained and segmented set of Gaussians 𝒢objsubscript𝒢obj\mathcal{G}_{\text{obj}}caligraphic_G start_POSTSUBSCRIPT obj end_POSTSUBSCRIPT can be used not only to mask out the objects for purpose of background reconstruction, but also as an initial estimate of the object appearance. Then, drawing inspiration from the previous hand-object reconstruction works that use a differentiable renderer such as [6, 73], we estimate the object pose for each video frame in the dynamic clip. Specifically, we estimate for each frame fjsubscript𝑓𝑗f_{j}italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT a corresponding pose 𝐩jsubscript𝐩𝑗\mathbf{p}_{j}bold_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT. We further decompose the pose 𝐩jsubscript𝐩𝑗\mathbf{p}_{j}bold_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT into a 3D translation vector 𝐭jsubscript𝐭𝑗\mathbf{t}_{j}bold_t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT and a rotation matrix 𝐑jsubscript𝐑𝑗\mathbf{R}_{j}bold_R start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT. Unlike previous dynamic Gaussian Splatting methods [63, 37], our method applies one set of transformation parameters to the whole collection as a whole, treating it as a single rigid object.

To ensure the transformation is rigid, any estimated 3×3333\times 33 × 3 rotation matrix must be 𝐑iSO(3)subscript𝐑𝑖𝑆𝑂3\mathbf{R}_{i}\in SO(3)bold_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_S italic_O ( 3 ). We optimize the rotation 𝐑jsubscript𝐑𝑗\mathbf{R}_{j}bold_R start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT using the 6D continuous rotation representation proposed by [77] and translation 𝐭jsubscript𝐭𝑗\mathbf{t}_{j}bold_t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT. Hence for each input frame fjsubscript𝑓𝑗f_{j}italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, assuming we have the Gaussians representing the object’s state in the previous frame, 𝒢obj,j1subscript𝒢obj𝑗1\mathcal{G}_{\text{obj},j-1}caligraphic_G start_POSTSUBSCRIPT obj , italic_j - 1 end_POSTSUBSCRIPT, we apply the current trainable translation and rotation parameters to it: 𝐗𝒢obj,j=𝐗𝒢obj,j1g(𝐑~j)+𝐭jsubscript𝐗subscript𝒢obj𝑗subscript𝐗subscript𝒢obj𝑗1𝑔subscript~𝐑𝑗subscript𝐭𝑗\mathbf{X}_{\mathcal{G}_{\text{obj},j}}=\mathbf{X}_{\mathcal{G}_{\text{obj},j-% 1}}\cdot g(\tilde{\mathbf{R}}_{j})+\mathbf{t}_{j}bold_X start_POSTSUBSCRIPT caligraphic_G start_POSTSUBSCRIPT obj , italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT = bold_X start_POSTSUBSCRIPT caligraphic_G start_POSTSUBSCRIPT obj , italic_j - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⋅ italic_g ( over~ start_ARG bold_R end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) + bold_t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, where 𝐗𝒢obj,jsubscript𝐗subscript𝒢obj𝑗\mathbf{X}_{\mathcal{G}_{\text{obj},j}}bold_X start_POSTSUBSCRIPT caligraphic_G start_POSTSUBSCRIPT obj , italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT is the 3D coordinates of Gaussians 𝒢obj,jsubscript𝒢obj𝑗\mathcal{G}_{\text{obj},j}caligraphic_G start_POSTSUBSCRIPT obj , italic_j end_POSTSUBSCRIPT corresponding to the object’s state at frame j𝑗jitalic_j, g()𝑔g(\cdot)italic_g ( ⋅ ) is a function defined in [77] that transforms the 6D representation of rotation to a standard 3×3333\times 33 × 3 rotation matrix.

In order to prioritize learning the object pose, during this stage the learning rate on the parameters of the Gaussians such as position and color are all lowered by a factor of 10. In order to better constrain the shape of the object and prevent it from extending to outside of the mask, we also apply a silhouette loss onto the computed alpha value, which can be computed using the following equation: A(𝐱p)=i𝒩αij=1i1(1αj)𝐴subscript𝐱𝑝subscript𝑖𝒩subscript𝛼𝑖superscriptsubscriptproduct𝑗1𝑖11subscript𝛼𝑗A(\mathbf{x}_{p})=\sum_{i\in\mathcal{N}}\alpha_{i}\prod_{j=1}^{i-1}\left(1-% \alpha_{j}\right)italic_A ( bold_x start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_N end_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∏ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i - 1 end_POSTSUPERSCRIPT ( 1 - italic_α start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) which effectively equates the RGB rendering equation without color. Our final object loss is then:

obj=1(𝐈obj,𝐈render,𝒢obj)+λ2(1𝐎,𝐀render,𝒢obj),subscriptobjsubscript1subscript𝐈objsubscript𝐈rendersubscript𝒢obj𝜆subscript21𝐎subscript𝐀rendersubscript𝒢obj\mathcal{L}_{\text{obj}}=\mathcal{L}_{1}\left(\mathbf{I}_{\text{obj}},\mathbf{% I}_{\text{render},\mathcal{G}_{\text{obj}}}\right)+\lambda\mathcal{L}_{2}\left% (1-\mathbf{O},\mathbf{A}_{\text{render},\mathcal{G}_{\text{obj}}}\right),caligraphic_L start_POSTSUBSCRIPT obj end_POSTSUBSCRIPT = caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_I start_POSTSUBSCRIPT obj end_POSTSUBSCRIPT , bold_I start_POSTSUBSCRIPT render , caligraphic_G start_POSTSUBSCRIPT obj end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) + italic_λ caligraphic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( 1 - bold_O , bold_A start_POSTSUBSCRIPT render , caligraphic_G start_POSTSUBSCRIPT obj end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) , (4)

where 𝐈objsubscript𝐈obj\mathbf{I}_{\text{obj}}bold_I start_POSTSUBSCRIPT obj end_POSTSUBSCRIPT is cropped from 𝐈inputsubscript𝐈input\mathbf{I}_{\text{input}}bold_I start_POSTSUBSCRIPT input end_POSTSUBSCRIPT with the object mask 𝐎𝐎\mathbf{O}bold_O, 𝐈render,𝒢objsubscript𝐈rendersubscript𝒢obj\mathbf{I}_{\text{render},\mathcal{G}_{\text{obj}}}bold_I start_POSTSUBSCRIPT render , caligraphic_G start_POSTSUBSCRIPT obj end_POSTSUBSCRIPT end_POSTSUBSCRIPT is rendered from 𝒢objsubscript𝒢obj\mathcal{G}_{\text{obj}}caligraphic_G start_POSTSUBSCRIPT obj end_POSTSUBSCRIPT so it contains the object only and black background, 𝐀render,𝒢objsubscript𝐀rendersubscript𝒢obj\mathbf{A}_{\text{render},\mathcal{G}_{\text{obj}}}bold_A start_POSTSUBSCRIPT render , caligraphic_G start_POSTSUBSCRIPT obj end_POSTSUBSCRIPT end_POSTSUBSCRIPT is the rendered alpha. We experimentally observe that 0.50.50.50.5 is a suitable λ𝜆\lambdaitalic_λ value.

As we optimize the time-dependent pose parameters 𝐭jsubscript𝐭𝑗\mathbf{t}_{j}bold_t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT and 𝐑~jsubscript~𝐑𝑗\tilde{\mathbf{R}}_{j}over~ start_ARG bold_R end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT one frame at a time, the Gaussians can easily overfit to the current frame. To address this, we train not only on the current frame, instead for every training iteration we train either on the current frame or a random previous frame with a probability of 0.50.50.50.5. A problem is that during the Gaussian splatting optimization process, the opacity is frequently set to zero in order to prune floaters. However, this would produce a very noisy signal for the pose optimization. As such, we instead alternate between optimizing the rigid object pose, and densifying/pruning the Gaussians. In practice, for every dynamic frame, we first train for 4k iterations only optimizing the object pose without pruning or densification. We then freeze the object pose, and train another 4k iterations to better incorporate any new information from the object pose, and finally train another 4k iterations again only the pose without densification or pruning. For all 12k iterations, all Gaussian parameters such as color and position are continuously optimized. After iterating through the whole dynamic clip with M𝑀Mitalic_M frames, we obtain a coarse object pose for each frame P={𝐭j,𝐑j|j=1,,M}𝑃conditional-setsubscript𝐭𝑗subscript𝐑𝑗𝑗1𝑀P=\{\mathbf{t}_{j},\mathbf{R}_{j}|j=1,\ldots,M\}italic_P = { bold_t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , bold_R start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | italic_j = 1 , … , italic_M }. Finally, we perform one final round of training on all frames jointly, with 6k iterations of pose estimation, 6k iterations of pruning/densifications, and finally another 6k iterations of pose estimation. This ensures that our object model is more equally fit onto all frames, rather than focused on the last seen ones. We provide an illustration of this process in Figure 3

Refer to caption
Figure 3: Object pose estimation pipeline. We use a sequential pipeline with regularization from previous frames, allowing us to simultaneously estimate the object poses and reconstruct the object for each transformation we model, while preserving the correct geometry of the object.

Combining the static scene and dynamic object. As a final step, we combine the object model estimated in Section3.4 with the full background model in Section3.3. In practice, we note that at this stage, there are often floaters belonging to the background that obscure parts of the object. To eliminate these, we perform a final fine-tuning stage using all training frames and Gaussians. As we focus here on optimizing how the background and dynamic object interact and fit with each other, we again freeze the per-frame object pose. This produces then the full scene reconstruction, including per-frame data of the object pose.

4 Experiments

We compare our method with existing baselines for the dynamic scene reconstruction task. The goal of this task is to reconstruct both static 3D scenes and dynamic objects from RGB egocentric videos.

4.1 Novel View Synthesis

Datasets. We evaluate our method for dynamic novel view synthesis on in-the-wild videos extracted from the two commonly used egocentric video datasets.

HOI4D [35] is a large-scale egocentric video dataset of human-object interactions, where each video is 20202020 seconds long. We randomly select 4444 videos involving active objects undergoing rigid transformations from this dataset. Among these, 2222 videos contain mostly translations, while the other 2222 include both translations and rotations. Compared to the original dataset, we downsample the image resolution to one-quarter of its original size, resulting in a resolution of 480480480480 x 270270270270 pixels.

EPIC-KITCHENS [11] is a large-scale dataset featuring in-the-wild egocentric videos of human-object interactions in native kitchen environments. We select 4444 video clips involving rigid transformations from the EPIC-KITCHENS dataset. Of these clips, 2 contain mostly translations, while the other 2 include both translations and rotations. Similar to the HOI4D dataset, we also downsample the image resolution to 455455455455 x 256256256256. The average length of these clips is 10.4310.4310.4310.43 seconds with 60606060 FPS.

Evaluation protocol. For each video, we train on every second frame, and evaluate on the rest. Although we are able to correctly track the trajectory of the motion with fewer training frames, the fast motion that comes with egocentric videos leads to a significant loss of information when doing so. As such, we use a much denser concentration of training frames.

Metrics. To assess the performance of our model, we use the peak signal-to-noise ratio (PSNR), the structural similarity index (SSIM)[60], and the VGG-based perceptual similarity metric LPIPS [75]. As we aim to reconstruct the background and object without the actor, we mask out the arm and body of any actors within the scene when computing these metrics, and only evaluate the quality of the object and background reconstruction.

Baselines. We compare our model’s results with the current state-of-the-art (SOTA) methods, Deformable 3DGS [68] and 4DGS [63]. All the compared methods have the publicly available codebase. We can run the code as it is and report the results. All the numbers reported in the tables are benchmarked on a single NVIDIA RTX 6000 or TITAN Xp GPU.

Results. Table 1 and Figure 5 compares our method with existing dynamic Gaussian Splatting methods. We observe that EgoGaussian significantly outperforms existing methods on all evaluation metrics over two datasets. The two SOTA methods have very close performance on both datasets and all methods have better results on static evaluation frames than on dynamic ones.

Table 1: Comparison with SOTA dynamic Gaussian Splatting methods. we evaluate our method and two other SOTA baselines on the HOI4D and EK datasets. The best and second best results are bolded and italicized respectively.
HOI4D Epic-Kitchen
Method Static Dynamic Static Dynamic
SSIM \uparrow PSNR \uparrow LPIPS \downarrow SSIM \uparrow PSNR \uparrow LPIPS \downarrow SSIM \uparrow PSNR \uparrow LPIPS \downarrow SSIM \uparrow PSNR \uparrow LPIPS \downarrow
4DGS [63] 0.89 25.45 0.13 0.89 25.25 0.13 0.91 33.25 0.12 0.79 22.77 0.24
Def-3DGS [68] 0.90 25.96 0.11 0.89 25.39 0.12 0.93 33.73 0.11 0.81 23.38 0.22
Ours 0.96 31.52 0.07 0.95 30.29 0.09 0.94 34.22 0.10 0.88 28.30 0.17

4.2 Dynamic modeling

Refer to caption
Figure 4: More qualitative results of the reconstructed dynamic scenes. The center figure shows the motion path, while the four F panels display individual trajectories of the reconstructed frames compared to the corresponding ground truth input frames.

We show in Figure 4 how we can reconstruct the scene including the object trajectory both using the original camera motion, or from a fixed point of view.

4.3 Ablation study

Without full scene fine-tuning. We show the necessity of fine tuning the static background and dynamic object as described in Section 3.4 jointly by comparing how our method performs when the background and object are only trained in isolation without fine-tuning on all frames or on the combined scene. As can be seen in Table 2 without full scene fine tuning, quality drops significantly.

Table 2: Ablation study of full scene fine tuning.
HOI4D Epic-Kitchen
Method Static Dynamic Static Dynamic
SSIM \uparrow PSNR \uparrow LPIPS \downarrow SSIM \uparrow PSNR \uparrow LPIPS \uparrow SSIM \uparrow PSNR \uparrow LPIPS \downarrow SSIM \uparrow PSNR \uparrow LPIPS \downarrow
With Fine-tuning 0.96 31.52 0.07 0.95 30.29 0.09 0.94 34.22 0.10 0.88 28.30 0.17
Without Fine-tuning 0.87 23.90 0.15 0.86 23.03 0.17 0.78 21.58 0.24 0.79 21.04 0.24

Estimate poses with larger time gap. We show that our method is also able to estimate poses with larger time gaps, by training on every 6 frames instead of every 2 frames. As seen in Table 3, although PSNR drops, we are still nonetheless able to produce an accurate reconstruction. Note that the metrics were computed only on the dynamic object itself.

Table 3: Ablation study of step size over the object on object reconstruction.
HOI4D Epic-Kitchen
Method Static Dynamic Static Dynamic
SSIM \uparrow PSNR \uparrow LPIPS \downarrow SSIM \uparrow PSNR \uparrow LPIPS \downarrow SSIM \uparrow PSNR \uparrow LPIPS \downarrow SSIM \uparrow PSNR \uparrow LPIPS \downarrow
With Original Step Size 0.98 31.82 0.03 0.98 29.79 0.04 0.97 28.87 0.05 0.97 31.33 0.05
With Larger Step Size 0.98 28.61 0.03 0.98 28.35 0.03 0.94 24.01 0.06 0.92 26.68 0.08

5 Conclusion

In this paper, we introduce EgoGaussian, the first method capable of simultaneously reconstructing 3D scenes and dynamically tracking 3D object motion from RGB egocentric input alone. We present a novel Gaussian Splatting framework that leverages the dynamic nature of human activities and distinguishes between dynamic interactions and static captures. This approach allows us to reliably reconstruct the static background and track object motion during the dynamic phase while progressively refining the object shape. Our method significantly outperforms previous SOTA baselines evaluated on two in-the-wild egocentric video datasets. Qualitative results demonstrate the high quality of the reconstructed dynamic scenes.

Limitations and discussion. While our method successfully reconstructs dynamic scenes from egocentric videos, several challenges remain. First, as we constrain objects to be rigid, we cannot model elastic or stretchable objects. Furthermore, although we substantially improve on previous methods, notable image artifacts and floaters still remain. Finally, the iterative process of estimating the object pose significantly increases training time.

Potential negative societal impacts.. EgoGaussian shares most of the societal impacts brought by others 3D modeling methods including potential unsolicited 3D reconstruction of, for example, private properties. With EgoGaussian and its focus on rigid inanimate objects, there is a heightened risk of industrial espionage. However, this risk is mitigated by the requirement that individuals must have access to images of the specific place or product to create a 3D model. This limitation reduces the likelihood of such privacy and security breaches.

GT

Ours

4DGS [63]

Def-3DGS [68]

[Uncaptioned image]

HOI Scene1

[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]

HOI Scene2

[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]

HOI Scene3

[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]

GT

Ours

4DGS [63]

Def-3DGS [68]

Refer to caption

HOI Scene4

Refer to caption
Refer to caption
Refer to caption

EK Scene1

Refer to caption
Refer to caption
Refer to caption
Refer to caption

EK Scene2

Refer to caption
Refer to caption
Refer to caption
Refer to caption

EK Scene3 (Fail)

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 5: Qualitative comparison with SOTA. We show reconstructions produced by our method and SOTA baselines (4DGS [63] and Deformable 3DGS [68]) from both HOI4D and EPIC-KITCHENS. Our reconstruction demonstrates more accurate reconstructions while baseline approaches fail to handle dynamic interactions.

References

  • [1] Joao Pedro Araújo, Jiaman Li, Karthik Vetrivel, Rishi Agarwal, Jiajun Wu, Deepak Gopinath, Alexander William Clegg, and Karen Liu. Circle: Capture in rich contextual environments. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21211–21221, 2023.
  • [2] A. Bansal and M. Zollhoefer. Neural pixel composition for 3d-4d view synthesis from multi-views. In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 290–299, Los Alamitos, CA, USA, jun 2023. IEEE Computer Society.
  • [3] Jonathan T. Barron, Ben Mildenhall, Matthew Tancik, Peter Hedman, Ricardo Martin-Brualla, and Pratul P. Srinivasan. Mip-nerf: A multiscale representation for anti-aliasing neural radiance fields. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 5835–5844, 2021.
  • [4] Jonathan T. Barron, Ben Mildenhall, Dor Verbin, Pratul P. Srinivasan, and Peter Hedman. Mip-nerf 360: Unbounded anti-aliased neural radiance fields. CVPR, 2022.
  • [5] Ang Cao and Justin Johnson. Hexplane: A fast representation for dynamic scenes. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 130–141, 2023.
  • [6] Zhe Cao, Ilija Radosavovic, Angjoo Kanazawa, and Jitendra Malik. Reconstructing hand-object interactions in the wild. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 12417–12426, 2021.
  • [7] Anpei Chen, Zexiang Xu, Andreas Geiger, **gyi Yu, and Hao Su. Tensorf: Tensorial radiance fields. In European Conference on Computer Vision (ECCV), 2022.
  • [8] Zhiqin Chen, Thomas Funkhouser, Peter Hedman, and Andrea Tagliasacchi. Mobilenerf: Exploiting the polygon rasterization pipeline for efficient neural field rendering on mobile architectures. In The Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
  • [9] Tianyi Cheng, Dandan Shan, Ayda Hassen, Richard Higgins, and David Fouhey. Towards a richer 2d understanding of hands at scale. Advances in Neural Information Processing Systems, 36:30453–30465, 2023.
  • [10] Enric Corona, Albert Pumarola, Guillem Alenya, Francesc Moreno-Noguer, and Grégory Rogez. Ganhand: Predicting human grasp affordances in multi-object scenes. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5031–5041, 2020.
  • [11] Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Sanja Fidler, Antonino Furnari, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, and Michael Wray. Scaling egocentric vision: The epic-kitchens dataset. In European Conference on Computer Vision (ECCV), 2018.
  • [12] Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Antonino Furnari, Jian Ma, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, and Michael Wray. Rescaling egocentric vision: Collection, pipeline and challenges for epic-kitchens-100. International Journal of Computer Vision (IJCV), 130:33–55, 2022.
  • [13] Ahmad Darkhalil, Dandan Shan, Bin Zhu, Jian Ma, Amlan Kar, Richard Higgins, Sanja Fidler, David Fouhey, and Dima Damen. Epic-kitchens visor benchmark: Video segmentations and object relations. Advances in Neural Information Processing Systems, 35:13745–13758, 2022.
  • [14] Kangle Deng, Andrew Liu, Jun-Yan Zhu, and Deva Ramanan. Depth-supervised NeRF: Fewer views and faster training for free. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2022.
  • [15] Yilun Du, Yinan Zhang, Hong-Xing Yu, Joshua B. Tenenbaum, and Jiajun Wu. Neural radiance flow for 4d view synthesis and video processing. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 14304–14314, 2020.
  • [16] Zicong Fan, Maria Parelli, Maria Eleni Kadoglou, Muhammed Kocabas, Xu Chen, Michael J Black, and Otmar Hilliges. HOLD: Category-agnostic 3d reconstruction of interacting hands and objects from video. 2024.
  • [17] Jiemin Fang, Taoran Yi, Xinggang Wang, Lingxi Xie, Xiaopeng Zhang, Wenyu Liu, Matthias Nießner, and Qi Tian. Fast dynamic radiance fields with time-aware neural voxels. In SIGGRAPH Asia 2022 Conference Papers, 2022.
  • [18] Fridovich-Keil and Yu, Matthew Tancik, Qinhong Chen, Benjamin Recht, and Angjoo Kanazawa. Plenoxels: Radiance fields without neural networks. In CVPR, 2022.
  • [19] Hang Gao, Ruilong Li, Shubham Tulsiani, Bryan Russell, and Angjoo Kanazawa. Dynamic novel-view synthesis: A reality check. In NeurIPS, 2022.
  • [20] Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. Ego4d: Around the world in 3,000 hours of egocentric video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18995–19012, 2022.
  • [21] Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Triantafyllos Afouras, Kumar Ashutosh, Vijay Baiyya, Siddhant Bansal, Bikram Boote, Eugene Byrne, Zach Chavis, Joya Chen, Feng Cheng, Fu-Jen Chu, Sean Crane, Avijit Dasgupta, **g Dong, Maria Escobar, Cristhian Forigua, Abrham Gebreselasie, Sanjay Haresh, **g Huang, Md Mohaiminul Islam, Suyog Jain, Rawal Khirodkar, Devansh Kukreja, Kevin J Liang, Jia-Wei Liu, Sagnik Majumder, Yongsen Mao, Miguel Martin, Effrosyni Mavroudi, Tushar Nagarajan, Francesco Ragusa, Santhosh Kumar Ramakrishnan, Luigi Seminara, Arjun Somayazulu, Yale Song, Shan Su, Zihui Xue, Edward Zhang, **xu Zhang, Angela Castillo, Changan Chen, Xinzhu Fu, Ryosuke Furuta, Cristina Gonzalez, Prince Gupta, Jiabo Hu, Yifei Huang, Yiming Huang, Weslie Khoo, Anush Kumar, Robert Kuo, Sach Lakhavani, Miao Liu, Mi Luo, Zhengyi Luo, Brighid Meredith, Austin Miller, Oluwatumininu Oguntola, Xiaqing Pan, Penny Peng, Shraman Pramanick, Merey Ramazanova, Fiona Ryan, Wei Shan, Kiran Somasundaram, Chenan Song, Audrey Southerland, Masatoshi Tateno, Huiyu Wang, Yuchen Wang, Takuma Yagi, Mingfei Yan, Xitong Yang, Zecheng Yu, Shengxin Cindy Zha, Chen Zhao, Ziwei Zhao, Zhifan Zhu, Jeff Zhuo, Pablo Arbelaez, Gedas Bertasius, David Crandall, Dima Damen, Jakob Engel, Giovanni Maria Farinella, Antonino Furnari, Bernard Ghanem, Judy Hoffman, C. V. Jawahar, Richard Newcombe, Hyun Soo Park, James M. Rehg, Yoichi Sato, Manolis Savva, Jianbo Shi, Mike Zheng Shou, and Michael Wray. Ego-exo4d: Understanding skilled human activity from first- and third-person perspectives, 2023.
  • [22] Qiao Gu, Zhaoyang Lv, Duncan Frost, Simon Green, Julian Straub, and Chris Sweeney. Egolifter: Open-world 3d segmentation for egocentric perception. arXiv preprint arXiv:2403.18118, 2024.
  • [23] Antoine Guédon and Vincent Lepetit. Sugar: Surface-aligned gaussian splatting for efficient 3d mesh reconstruction and high-quality mesh rendering. arXiv preprint arXiv:2311.12775, 2023.
  • [24] Yuan-Chen Guo, Di Kang, Linchao Bao, Yu He, and Song-Hai Zhang. Nerfren: Neural radiance fields with reflections. CoRR, abs/2111.15234, 2021.
  • [25] Vladimir Guzov, Julian Chibane, Riccardo Marin, Yannan He, Torsten Sattler, and Gerard Pons-Moll. Interaction replica: Tracking human-object interaction and scene changes from human motion. arXiv preprint arXiv:2205.02830, 2022.
  • [26] Mohamed Hassan, Partha Ghosh, Joachim Tesch, Dimitrios Tzionas, and Michael J Black. Populating 3d scenes by learning human-scene interaction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14708–14718, 2021.
  • [27] Peter Hedman, Pratul P. Srinivasan, Ben Mildenhall, Jonathan T. Barron, and Paul Debevec. Baking neural radiance fields for real-time view synthesis. ICCV, 2021.
  • [28] Chun-Hao P. Huang, Hongwei Yi, Markus Höschle, Matvey Safroshkin, Tsvetelina Alexiadis, Senya Polikovsky, Daniel Scharstein, and Michael J. Black. Capturing and inferring dense full-body human-scene contact. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13274–13285, June 2022.
  • [29] Bernhard Kerbl, Georgios Kopanas, Thomas Leimkuehler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering. ACM Trans. Graph., 42(4), jul 2023.
  • [30] Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4015–4026, 2023.
  • [31] Tobias Kirschstein, Shenhan Qian, Simon Giebenhain, Tim Walter, and Matthias Nießner. Nersemble: Multi-view radiance field reconstruction of human heads. ACM Trans. Graph., 42(4), jul 2023.
  • [32] Jenny Lin, Xingwen Guo, **gyu Shao, Chenfanfu Jiang, Yixin Zhu, and Song-Chun Zhu. A virtual reality platform for dynamic human-scene interaction. In SIGGRAPH ASIA 2016 virtual reality meets physical reality: Modelling and simulating virtual humans and environments, pages 1–4. 2016.
  • [33] Shaowei Liu, Hanwen Jiang, Jiarui Xu, Sifei Liu, and Xiaolong Wang. Semi-supervised 3d hand-object poses estimation with interactions in time. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14687–14697, 2021.
  • [34] Yu-Lun Liu, Chen Gao, Andreas Meuleman, Hung-Yu Tseng, Ayush Saraf, Changil Kim, Yung-Yu Chuang, Johannes Kopf, and Jia-Bin Huang. Robust dynamic radiance fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023.
  • [35] Yunze Liu, Yun Liu, Che Jiang, Kangbo Lyu, Weikang Wan, Hao Shen, Boqiang Liang, Zhoujie Fu, He Wang, and Li Yi. Hoi4d: A 4d egocentric dataset for category-level human-object interaction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 21013–21022, June 2022.
  • [36] Zhicheng Lu, Xiang Guo, Le Hui, Tianrui Chen, Ming Yang, Xiao Tang, Feng Zhu, and Yuchao Dai. 3d geometry-aware deformable gaussian splatting for dynamic view synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024.
  • [37] Jonathon Luiten, Georgios Kopanas, Bastian Leibe, and Deva Ramanan. Dynamic 3d gaussians: Tracking by persistent dynamic view synthesis. In 3DV, 2024.
  • [38] Zhaoyang Lv, Nickolas Charron, Pierre Moulon, Alexander Gamino, Cheng Peng, Chris Sweeney, Edward Miller, Huixuan Tang, Jeff Meissner, **g Dong, et al. Aria everyday activities dataset. arXiv preprint arXiv:2402.13349, 2024.
  • [39] Li Ma, Xiaoyu Li, **g Liao, Qi Zhang, Xuan Wang, Jue Wang, and Pedro V. Sander. Deblur-nerf: Neural radiance fields from blurry images. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12851–12860, 2021.
  • [40] Ricardo Martin-Brualla, Noha Radwan, Mehdi S. M. Sajjadi, Jonathan T. Barron, Alexey Dosovitskiy, and Daniel Duckworth. NeRF in the Wild: Neural Radiance Fields for Unconstrained Photo Collections. In CVPR, 2021.
  • [41] Ben Mildenhall, Peter Hedman, Ricardo Martin-Brualla, Pratul P. Srinivasan, and Jonathan T. Barron. NeRF in the dark: High dynamic range view synthesis from noisy raw images. CVPR, 2022.
  • [42] Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis. In ECCV, 2020.
  • [43] Thomas Müller, Alex Evans, Christoph Schied, and Alexander Keller. Instant neural graphics primitives with a multiresolution hash encoding. ACM Trans. Graph., 41(4):102:1–102:15, July 2022.
  • [44] Xiaqing Pan, Nicholas Charron, Yongqian Yang, Scott Peters, Thomas Whelan, Chen Kong, Omkar Parkhi, Richard Newcombe, and Yuheng Carl Ren. Aria digital twin: A new benchmark dataset for egocentric 3d machine perception. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 20133–20143, 2023.
  • [45] Keunhong Park, Utkarsh Sinha, Jonathan T. Barron, Sofien Bouaziz, Dan B Goldman, Steven M. Seitz, and Ricardo Martin-Brualla. Nerfies: Deformable neural radiance fields. ICCV, 2021.
  • [46] Keunhong Park, Utkarsh Sinha, Peter Hedman, Jonathan T. Barron, Sofien Bouaziz, Dan B Goldman, Ricardo Martin-Brualla, and Steven M. Seitz. Hypernerf: A higher-dimensional representation for topologically varying neural radiance fields. ACM Trans. Graph., 40(6), dec 2021.
  • [47] Albert Pumarola, Enric Corona, Gerard Pons-Moll, and Francesc Moreno-Noguer. D-nerf: Neural radiance fields for dynamic scenes. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10313–10322, 2020.
  • [48] Wentian Qu, Zhaopeng Cui, Yinda Zhang, Chenyu Meng, Cuixia Ma, Xiaoming Deng, and Hongan Wang. Novel-view synthesis and pose estimation for hand-object interaction from sparse views. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15100–15111, 2023.
  • [49] Xiaofeng Ren and Chunhui Gu. Figure-ground segmentation improves handled object recognition in egocentric video. In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 3137–3144. IEEE, 2010.
  • [50] Sara Fridovich-Keil and Giacomo Meanti, Frederik Rahbæk Warburg, Benjamin Recht, and Angjoo Kanazawa. K-planes: Explicit radiance fields in space, time, and appearance. In CVPR, 2023.
  • [51] Katja Schwarz, Yiyi Liao, Michael Niemeyer, and Andreas Geiger. Graf: Generative radiance fields for 3d-aware image synthesis. ArXiv, abs/2007.02442, 2020.
  • [52] Ruizhi Shao, Zerong Zheng, Hanzhang Tu, Boning Liu, Hongwen Zhang, and Yebin Liu. Tensor4d: Efficient neural 4d decomposition for high-fidelity dynamic reconstruction and rendering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2023.
  • [53] Kiran Somasundaram, **g Dong, Huixuan Tang, Julian Straub, Mingfei Yan, Michael Goesele, Jakob Julian Engel, Renzo De Nardi, and Richard Newcombe. Project aria: A new tool for egocentric multi-modal ai research. arXiv preprint arXiv:2308.13561, 2023.
  • [54] Cheng Sun, Min Sun, and Hwann-Tzong Chen. Direct voxel grid optimization: Super-fast convergence for radiance fields reconstruction. In CVPR, 2022.
  • [55] Bugra Tekin, Federica Bogo, and Marc Pollefeys. H+ o: Unified egocentric recognition of 3d hand-object poses and interactions. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4511–4520, 2019.
  • [56] Vadim Tschernezki, Ahmad Darkhalil, Zhifan Zhu, David Fouhey, Iro Laina, Diane Larlus, Dima Damen, and Andrea Vedaldi. Epic fields: Marrying 3d geometry and video understanding, 2024.
  • [57] Vadim Tschernezki, Diane Larlus, and Andrea Vedaldi. NeuralDiff: Segmenting 3D objects that move in egocentric videos. In Proceedings of the International Conference on 3D Vision (3DV), 2021.
  • [58] Dor Verbin, Peter Hedman, Ben Mildenhall, Todd Zickler, Jonathan T. Barron, and Pratul P. Srinivasan. Ref-NeRF: Structured view-dependent appearance for neural radiance fields. CVPR, 2022.
  • [59] Feng Wang, Sinan Tan, ** Liu. Mixed neural voxels for fast multi-view video synthesis. 2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 19649–19659, 2022.
  • [60] Zhou Wang, A.C. Bovik, H.R. Sheikh, and E.P. Simoncelli. Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing, 13(4):600–612, 2004.
  • [61] Bowen Wen, Jonathan Tremblay, Valts Blukis, Stephen Tyree, Thomas Muller, Alex Evans, Dieter Fox, Jan Kautz, and Stan Birchfield. Bundlesdf: Neural 6-dof tracking and 3d reconstruction of unknown objects. CVPR, 2023.
  • [62] Yu-Shiang Wong, Changjian Li, Matthias Nießner, and Niloy J Mitra. Rigidfusion: Rgb-d scene reconstruction with rigidly-moving objects. In Computer Graphics Forum, volume 40, pages 511–522. Wiley Online Library, 2021.
  • [63] Guanjun Wu, Taoran Yi, Jiemin Fang, Lingxi Xie, Xiaopeng Zhang, Wei Wei, Wenyu Liu, Qi Tian, and Wang Xinggang. 4d gaussian splatting for real-time dynamic scene rendering. arXiv preprint arXiv:2310.08528, 2023.
  • [64] Wenqi Xian, Jia-Bin Huang, Johannes Kopf, and Changil Kim. Space-time neural irradiance fields for free-viewpoint video. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9416–9426, 2020.
  • [65] Qiangeng Xu, Zexiang Xu, Julien Philip, Sai Bi, Zhixin Shu, Kalyan Sunkavalli, and Ulrich Neumann. Point-nerf: Point-based neural radiance fields. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5428–5438, 2022.
  • [66] **yu Yang, Mingqi Gao, Zhe Li, Shang Gao, Fang**g Wang, and Feng Zheng. Track anything: Segment anything meets videos, 2023.
  • [67] Lixin Yang, Xinyu Zhan, Kailin Li, Wenqiang Xu, Jiefeng Li, and Cewu Lu. Cpf: Learning a contact potential field to model the hand-object interaction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11097–11106, 2021.
  • [68] Ziyi Yang, Xinyu Gao, Wen Zhou, Shaohui Jiao, Yuqing Zhang, and Xiaogang **. Deformable 3d gaussians for high-fidelity monocular dynamic scene reconstruction. arXiv preprint arXiv:2309.13101, 2023.
  • [69] Yufei Ye, Abhinav Gupta, and Shubham Tulsiani. What’s in your hands? 3d reconstruction of generic objects in hands. In CVPR, 2022.
  • [70] Yufei Ye, Poorvi Hebbar, Abhinav Gupta, and Shubham Tulsiani. Diffusion-guided reconstruction of everyday hand-object interaction clips. In ICCV, 2023.
  • [71] Hongwei Yi, Chun-Hao P. Huang, Shashank Tripathi, Lea Hering, Justus Thies, and Michael J. Black. MIME: Human-aware 3D scene generation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 12965–12976, June 2023.
  • [72] Alex Yu, Ruilong Li, Matthew Tancik, Hao Li, Ren Ng, and Angjoo Kanazawa. Plenoctrees for real-time rendering of neural radiance fields. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 5732–5741, 2021.
  • [73] Jason Y Zhang, Sam Pepose, Hanbyul Joo, Deva Ramanan, Jitendra Malik, and Angjoo Kanazawa. Perceiving 3d human-object spatial arrangements from a single image in the wild. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XII 16, pages 34–51. Springer, 2020.
  • [74] Lingzhi Zhang, Shenghao Zhou, Simon Stent, and Jianbo Shi. Fine-grained egocentric hand-object segmentation: Dataset, model, and applications. In European Conference on Computer Vision, pages 127–145. Springer, 2022.
  • [75] Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In CVPR, 2018.
  • [76] Kaifeng Zhao, Shaofei Wang, Yan Zhang, Thabo Beeler, and Siyu Tang. Compositional human-scene interaction synthesis with semantic control. In European Conference on Computer Vision, pages 311–327. Springer, 2022.
  • [77] Yi Zhou, Connelly Barnes, Lu **gwan, Yang Jimei, and Li Hao. On the continuity of rotation representations in neural networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
  • [78] Zhifan Zhu and Dima Damen. Get a grip: Reconstructing hand-object stable grasps in egocentric videos. arXiv preprint arXiv:2312.15719, 2023.

Appendix A Appendix / supplemental material

A.1 NeRF-based method experiments

Datasets and evaluation protocol.

In this section we present some supplementary results led with the same dataset as in section 4.1 but with a different division of training and validation frames. For a single given video, We use for training nine out of ten frames from static parts and one out of six frames for dynamic parts.

Metrics and Baseline.

We compare our model to some well known NeRF-based architecture able to model dynamic scene. For it we use Neuraldiff [57] and T-NeRF [47]. We used the official implementation with default parameter for NeuralDiff and the implementation provided by [19] for T-NeRF. For T-NeRF we used the default setting for "nerfies" type dataset and the additional preprocessing was borrowed from [45].

Table 4: Results on dynamic NeRF methods. We evaluate two baselines on the HOI4D and EK datasets.
HOI4D Epic-Kitchen
Method Static Dynamic Static Dynamic
SSIM \uparrow PSNR \uparrow LPIPS \downarrow SSIM \uparrow PSNR \uparrow LPIPS \downarrow SSIM \uparrow PSNR \uparrow LPIPS \downarrow SSIM \uparrow PSNR \uparrow LPIPS \downarrow
NeuralDiff [57] 0.71 21.15 0.32 0.76 23.98 0.26 0.89 31.06 0.13 0.79 24.85 0.24
T-NeRF [47] 0.84 23.00 0.10 0.78 19.96 0.14 0.80 14.52 0.17 0.75 21.33 0.22

GT

NeuralDiff [57]

T-NeRF [47]

Refer to caption

HOI Scene 2

Refer to caption
Refer to caption

HOI Scene 3

Refer to caption
Refer to caption
Refer to caption

HOI Scene 4

Refer to caption
Refer to caption
Refer to caption
Figure 6: Three different scenes from HOI4D, we show reconstructions produced by NeuralDiff [57] and T-NeRF [19]