HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.

  • failed: bigstrut

Authors: achieve the best HTML results from your LaTeX submissions by following these best practices.

License: arXiv.org perpetual non-exclusive license
arXiv:2402.16407v1 [cs.CV] 26 Feb 2024
\vgtccategory

Research \vgtcinsertpkg\preprinttextAuthor’s preprint version. To appear in the proceedings of the IEEE VR 2024 Conference. It will be updated with a DOI when available. \teaser [Uncaptioned image] Overview of our proposed method. Given sparse input view images, we treat every input view as the reference view and construct their corresponding MPI respectively, where each MPI is parameterized by individual MLP (see Sec. 3.1 for details). Since the novel view image can be rendered by any MPI and deserve to have the same colors and depths, we propose the appearance and depth consistency loss to fully utilize cross-view multiplane consistency (see Sec. 3.2 for details).

CMC: Few-shot Novel View Synthesis via Cross-view Multiplane Consistency

Hanxin Zhu
University of Science and Technology of China
e-mail: [email protected]
   Tianyu He
Microsoft Research Asia
e-mail: [email protected]
   Zhibo Chen
University of Science and Technology of China
e-mail: [email protected]
Abstract

Neural Radiance Field (NeRF) has shown impressive results in novel view synthesis, particularly in Virtual Reality (VR) and Augmented Reality (AR), thanks to its ability to represent scenes continuously. However, when just a few input view images are available, NeRF tends to overfit the given views and thus make the estimated depths of pixels share almost the same value. Unlike previous methods that conduct regularization by introducing complex priors or additional supervisions, we propose a simple yet effective method that explicitly builds depth-aware consistency across input views to tackle this challenge. Our key insight is that by forcing the same spatial points to be sampled repeatedly in different input views, we are able to strengthen the interactions between views and therefore alleviate the overfitting problem. To achieve this, we build the neural networks on layered representations (i.e., multiplane images), and the sampling point can thus be resampled on multiple discrete planes. Furthermore, to regularize the unseen target views, we constrain the rendered colors and depths from different input views to be the same. Although simple, extensive experiments demonstrate that our proposed method can achieve better synthesis quality over state-of-the-art methods.

\CCScatlist\CCScatTwelve

Neural Radiance FieldsFew-shot view synthesisMultiplane ImagesCross-view consistency

Introduction

As a fundamental task in computer vision and computer graphics, novel view synthesis aims at rendering novel view images from given several posed input view images [4, 9]. Recently, Neural Radiance Field (NeRF) [25] has gained increasing popularity due to its powerful ability in continuous scene representation and its superior performance of novel view synthesis.

Refer to caption
Figure 1: Given a few input views (e.g., 3 input views), (a) NeRF tends to overfit to input views and results in a dramatic performance drop, where the estimated depths of pixels share almost the same value. (b) Our key insight is to ensure the same spatial points can be sampled repeatedly in different input views. (c) Our proposed method can achieve smooth depth estimation by introducing cross-view multiplane consistency, resulting in better synthesis quality.

However, the success of NeRF and its variants depends on the number of input views to a large extent [16]. As shown in Fig. 1(a), when just a few input views are given, NeRF tends to overfit input views, resulting in the estimated depths of pixels sharing almost the same value [16, 59]. In principle, this overfitting problem could be alleviated by incorporating priors of different scenes into the neural network [6, 8, 23, 41, 46, 47, 57]. However, these methods require expensive pre-training cost and the pre-trained scenes usually exist domain gap for the target scene [28].

More recently, remarkable progress has also been made toward alleviating the overfitting problem by introducing external supervisions [10, 33, 43], pseudo views [1, 7, 18, 43, 52], or physical priors [16, 17, 28]. For example, Jain et al. [16] introduced semantic consistency between various views to encourage realistic renderings. Niemeyer et al. [28] regularized the geometry and appearance of patches for each unseen view. Although effective, the aforementioned methods either ignore the consistency across multiple views [28, 17] or impose the cross-view consistency solely on the image level [16], thereby limiting the performance.

To tackle this challenge, we make an assumption: due to fewer input views, the sampling point in each ray would rarely be used to render other views, therefore the neural networks tend to memorize colors of each input view instead of learning the underlying geometry [58, 2]. To validate this assumption, we propose Cross-view Multiplane Consistency (CMC), in which we force the sampling points to remain identical when rendering different views, as demonstrated in Fig. 1(b). In this way, the sampling points are able to be rendered to different-view images, resulting in depth-aware consistency across views. More specifically, for each input view, we build individual layered representations (i.e.,, Multiplane Images) by regarding the input view as the reference view of the Multiplane Images (MPI) [62]. Therefore, based on the discrete multiplane representation, all sampling points are forced to be distributed on the same fixed planes.

Given the multiplane representation for each input view, we aim at imposing cross-view consistency on multiplanes during the optimization. We recognize this in two aspects: 1) for the input views whose ground-truth images are available, we optimize each MPI using a reconstruction loss that minimizes the difference between the rendered input view images and the ground-truth input view images. 2) for the unseen views that lack ground-truth images, we leverage the underlying consistency: the colors and depths that are rendered from different input views (i.e.,, different MPIs) should maintain the same. As a result, we achieve cross-view multiplanes consistency.

We verify our assumptions and proposals on the common LLFF [24] and Shiny [50] dataset, where the overfitting problem can be well overcame with a promising improvement in the qualities of synthetic novel views.

The main contributions of this paper can be summarized as follows:

  • We propose to force the sampling points to be the same when rendering different views, which alleviates the overfitting problem of few-shot novel view synthesis.

  • To achieve cross-view multiplane consistency, in addition to reconstruction loss for input views, we propose to impose appearance and depth consistency to the unseen views.

  • We provide an explanation for the overfitting problem and then give the intuition behind our proposed CMC.

  • Our proposed method achieves state-of-the-art performance on various widely adopted datasets.

1 Related Work

1.1 Novel View Synthesis

As a long-standing problem in computer vision and computer graphics, novel view synthesis has been studied for decades with methods based on image-based rendering [4, 5, 9, 37], light fields [20, 24, 40, 51], point clouds [19, 39, 49, 55] and learning-based representation [11, 12, 32, 63]. Recently people have witnessed an increasing popularity for Neural Radiance Field (NeRF) [25] due to its remarkable performance for novel view synthesis. Given several 2D input view images of a static scene, NeRF can render photorealistic novel view images through coordinate-based implicit neural representation. It has been extended to several different tasks, such as dynamic scenes representation [29, 30], fast training and rendering [13, 27, 50, 56], stylization [15, 26, 45], generalizable scenes representation [6, 23, 47, 53] etc. Though NeRF achieved great synthesis quality, it depends on dense input view images, which would be not suitable for many practical applications. As a result, in this paper, we focus our attention on view synthesis with sparse input views, e.g., few-shot novel view synthesis.

1.2 Few-shot NeRF

When only a few input view images with big disparities are available, NeRF easily overfits these input views, as shown in Fig. 1(a). Some generalizable neural fields [6, 8, 47, 57] could avoid this problem by using large-scale cross-scenes datasets to learn scenes priors, while the performance will degrade significantly when there is a large domain gap between the test scenes and the training dataset. [10, 33, 43] proposed to overcome the overfitting tendency of the few-shot setting in a per-scene optimization manner with additional supervision signals, such as sparse depth estimated by Structure-from-Motion [34] or pixel correspondence estimated by [42]. To increase the number of training views available, [1, 7, 18, 52] proposed to use depth-war** to generate novel view images as pseudo labels. [16, 17, 28] made use of physical priors to regularize the scene geometry without any additional supervision signals. Recently, FreeNeRF [54] mitigated the overfitting problem from the pespective of frequency, where a novel frequency annealing strategy on positional encoding was proposed. SimpleNeRF [38] instead leveraged augmented models for better and stable few-shot view synthesis. MixNeRF [36] modeled rays as mixtures of Laplacianssians, followed by FlipNeRF [35] which used flipped reflection rays as additional training sources.

Though these methods would achieve promising results, they either heavily rely on pre-trained neural networks that are usually expensive [33], or only take advantage of physical priors as regularization terms on seen/unseen views independently, without cross-view interactions [17]. Instead, in this paper, we propose to make full use of cross-view consistency to achieve the few-shot novel view synthesis.

1.3 Multiplane Images

MPI was first proposed by [62] to expand the small baselines of stereo images. Then [24] extended MPI to view synthesis by constructing local MPIs and blending different MPIs to render novel views. To achieve a fast generation of MPI, DeepView was proposed by [11] through the leverage of learned gradient descent. To model the time-dependent effects of scenes shot at different times, DeepMPI was introduced by [22] in an unsupervised manner. [14, 21, 44] further proposed to use MPI to realize single-view synthesis. Recently [50] has been proposed to model view-dependent effects and to realize real-time rendering. Then [61] proposed to take advantage of MPIs to make a 2D GAN 3D-aware. In this paper, we first apply MPI to few-shot view synthesis, where every input view is treated as the reference view respectively. To enhance the interactions across different views, we propose two new loss functions, i.e., the appearance and depth consistency loss, based on the fact that the rendered colors and depths of the target view by different MPIs should be the same.

2 Preliminaries

Our method is built upon Neural Radiance Field (NeRF) [25] and Multiplane Images (MPI) [62]. We elaborate on them in this section.

2.1 Neural Radiance Field

NeRF [25] has emerged as a powerful tool for continuous scene representation by encoding scene properties into a neural network Fθsubscript𝐹𝜃F_{\theta}italic_F start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT, which is usually parameterized by one Multilayer Perceptron (MLP). Input the 3D coordinate 𝐱=(x,y,z)𝐱𝑥𝑦𝑧\mathbf{x}=(x,y,z)bold_x = ( italic_x , italic_y , italic_z ) of a spatial point and one viewing direction 𝐝=(dx,dy,dz)𝐝subscript𝑑𝑥subscript𝑑𝑦subscript𝑑𝑧\mathbf{d}=(d_{x},d_{y},d_{z})bold_d = ( italic_d start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ), NeRF outputs the corresponding color 𝐜𝐜\mathbf{c}bold_c and volume density σ𝜎\sigmaitalic_σ, which is denoted as:

𝐜,σ=Fθ(γ(𝐱),γ(𝐝)),𝐜𝜎subscript𝐹𝜃𝛾𝐱𝛾𝐝\centering\mathbf{c},\sigma=F_{\theta}(\gamma(\mathbf{x}),\gamma(\mathbf{d})),\@add@centeringbold_c , italic_σ = italic_F start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_γ ( bold_x ) , italic_γ ( bold_d ) ) , (1)

where γ𝛾\gammaitalic_γ is the position encoding operation [25] that aim to recovering high-frequency detail textures.

Given several input view images and their camera parameters, a pixel can be rendered by casting a ray 𝐫(t)=𝐨+t𝐝𝐫𝑡𝐨𝑡𝐝\mathbf{r}(t)=\mathbf{o}+t\mathbf{d}bold_r ( italic_t ) = bold_o + italic_t bold_d from the camera origin 𝐨𝐨\mathbf{o}bold_o towards the pixel along direction 𝐝𝐝\mathbf{d}bold_d. Specifically, assuming t[tn,tf]𝑡subscript𝑡𝑛subscript𝑡𝑓t\in[t_{n},t_{f}]italic_t ∈ [ italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ], the estimated color 𝐂(𝐫)𝐂𝐫\mathbf{C}(\mathbf{r})bold_C ( bold_r ) of this pixel is formulated as follows:

𝐂(𝐫)=tntfT(t)σ(𝐫(t))𝐜(𝐫(t),𝐝)𝑑t,𝐂𝐫superscriptsubscriptsubscript𝑡𝑛subscript𝑡𝑓𝑇𝑡𝜎𝐫𝑡𝐜𝐫𝑡𝐝differential-d𝑡\centering\mathbf{C}(\mathbf{r})=\int_{t_{n}}^{t_{f}}T(t)\sigma(\mathbf{r}(t))% \mathbf{c}(\mathbf{r}(t),\mathbf{d})dt,\@add@centeringbold_C ( bold_r ) = ∫ start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_T ( italic_t ) italic_σ ( bold_r ( italic_t ) ) bold_c ( bold_r ( italic_t ) , bold_d ) italic_d italic_t , (2)

where T(t)=exp(tntσ(𝐫(s))𝑑s)𝑇𝑡superscriptsubscriptsubscript𝑡𝑛𝑡𝜎𝐫𝑠differential-d𝑠T(t)=\exp\left(-\int_{t_{n}}^{t}\sigma(\mathbf{r}(s))ds\right)italic_T ( italic_t ) = roman_exp ( - ∫ start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_σ ( bold_r ( italic_s ) ) italic_d italic_s ), σ𝜎\sigmaitalic_σ and 𝐜𝐜\mathbf{c}bold_c are obtained by Eq. 1. NeRF is optimized by minimizing the following loss function:

MSE=1||𝐫𝐂(𝐫)𝐂gt22,subscriptMSE1subscript𝐫superscriptsubscriptnorm𝐂𝐫subscript𝐂𝑔𝑡22\centering\mathcal{L}_{\text{MSE}}=\frac{1}{|\mathcal{R}|}\sum_{\mathbf{r}\in% \mathcal{R}}\|\mathbf{C}(\mathbf{r})-\mathbf{C}_{gt}\|_{2}^{2},\@add@centeringcaligraphic_L start_POSTSUBSCRIPT MSE end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG | caligraphic_R | end_ARG ∑ start_POSTSUBSCRIPT bold_r ∈ caligraphic_R end_POSTSUBSCRIPT ∥ bold_C ( bold_r ) - bold_C start_POSTSUBSCRIPT italic_g italic_t end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (3)

where \mathcal{R}caligraphic_R is a set of sampling rays, 𝐂(𝐫)𝐂𝐫\mathbf{C}(\mathbf{r})bold_C ( bold_r ) is obtained by Eq. 2 and 𝐂gtsubscript𝐂𝑔𝑡\mathbf{C}_{gt}bold_C start_POSTSUBSCRIPT italic_g italic_t end_POSTSUBSCRIPT represents the ground-truth color.

2.2 Multiplane Images

As a layered scene representation, MPI [62] is constructed by a set of frontop-parallel planes with respect to a reference view, where all planes are fixed at specific depths that are distributed equally in the depth space. Considering one MPI with D𝐷Ditalic_D planes (𝐜i,αi)i=1Dsuperscriptsubscriptsubscript𝐜𝑖subscript𝛼𝑖𝑖1𝐷(\mathbf{c}_{i},\alpha_{i})_{i=1}^{D}( bold_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT, the i𝑖{i}italic_i-th plane at depth zisubscript𝑧𝑖z_{i}italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT can be viewed as a 4444-channel RGBA image that contains the color 𝐜isubscript𝐜𝑖\mathbf{c}_{i}bold_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and visibility αisubscript𝛼𝑖\alpha_{i}italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.

To render a target view based on the MPI of the reference view, each plane of the MPI is warped to the target view (𝐜i,αi)i=1Dsuperscriptsubscriptsuperscriptsubscript𝐜𝑖superscriptsubscript𝛼𝑖𝑖1𝐷(\mathbf{c}_{i}^{\prime},\alpha_{i}^{\prime})_{i=1}^{D}( bold_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT using inverse homography war**, followed by an alpha-composition operation [14, 21, 44, 62]. The rendered image 𝐈tsubscript𝐈𝑡\mathbf{I}_{t}bold_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and depth map Ztsubscript𝑍𝑡{Z}_{t}italic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT of target view are denoted as follows:

𝐈t=i=1D(𝐜iαij=1i1(1αj))subscript𝐈𝑡superscriptsubscript𝑖1𝐷superscriptsubscript𝐜𝑖superscriptsubscript𝛼𝑖superscriptsubscriptproduct𝑗1𝑖11superscriptsubscript𝛼𝑗\centering\mathbf{I}_{t}=\sum_{i=1}^{D}\left(\mathbf{c}_{i}^{\prime}\alpha_{i}% ^{\prime}\prod_{j=1}^{i-1}\left(1-\alpha_{j}^{\prime}\right)\right)\@add@centeringbold_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT ( bold_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∏ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i - 1 end_POSTSUPERSCRIPT ( 1 - italic_α start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) (4)
Zt=i=1D(ziαij=1i1(1αj)).subscript𝑍𝑡superscriptsubscript𝑖1𝐷subscript𝑧𝑖superscriptsubscript𝛼𝑖superscriptsubscriptproduct𝑗1𝑖11superscriptsubscript𝛼𝑗\centering{Z}_{t}=\sum_{i=1}^{D}\left(z_{i}\alpha_{i}^{\prime}\prod_{j=1}^{i-1% }\left(1-\alpha_{j}^{\prime}\right)\right).\@add@centeringitalic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT ( italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∏ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i - 1 end_POSTSUPERSCRIPT ( 1 - italic_α start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) . (5)

We build our model on MPI. Therefore, the sampling point can be resampled on multiple discrete planes.

3 Cross-view Multiplane Consistency

Motivation.

As shown in Fig. 1(a), NeRF suffers from significant performance degradation when the number of input views is reduced, which also leads to the estimated depths of pixels sharing almost the same value [16, 59]. To tackle this problem, we assume that one plausible reason is that the sampling point in each ray would rarely be used to render other views due to fewer input views. Therefore, it is easier for the neural networks to memorize each input view images [58, 2], rather than learning the underlying geometry. Motivated by this, our key insight is to explicitly build depth-aware consistency across different views.

Method Overview.

As shown in Fig. CMC: Few-shot Novel View Synthesis via Cross-view Multiplane Consistency, to ensure that the sampling points are the same when rendering different views, we build individual layered representation (i.e., Multiplane Images) Fθisuperscriptsubscript𝐹𝜃𝑖F_{\theta}^{i}italic_F start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT for each input view i𝑖iitalic_i by utilizing the input view i𝑖iitalic_i as the reference view of the Multiplane Images (MPI) [62]. Therefore, all sampling points are distributed on the same fixed planes. Inspired by previous works [50, 21], each MPI is presented by a multilayer perceptron (MLP) Fθisuperscriptsubscript𝐹𝜃𝑖F_{\theta}^{i}italic_F start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT, which outputs the color and visibility for each plane.

To optimize Fθisuperscriptsubscript𝐹𝜃𝑖F_{\theta}^{i}italic_F start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT, for the input views, we directly minimize the difference between rendered images and the ground-truth ones through a reconstruction loss. While there is no ground-truth image for the unseen views, we introduce an intuition that the colors and depth rendered by different input views should be the same. Specifically, we minimize the difference in the estimated colors and depths that are obtained by different MPIs.

3.1 Multiplane Representation for Input Views

As shown in Fig. CMC: Few-shot Novel View Synthesis via Cross-view Multiplane Consistency, given several sparse input view images {𝐈ini}i=0N1RH×W×3superscriptsubscriptsuperscriptsubscript𝐈𝑖𝑛𝑖𝑖0𝑁1superscript𝑅𝐻𝑊3\{\mathbf{I}_{in}^{i}\}_{i=0}^{N-1}\in\mathit{R}^{H\times W\times 3}{ bold_I start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N - 1 end_POSTSUPERSCRIPT ∈ italic_R start_POSTSUPERSCRIPT italic_H × italic_W × 3 end_POSTSUPERSCRIPT and their corresponding camera extrinsics [𝐑ini|𝐭ini]i=0N1SE(3)superscriptsubscriptdelimited-[]conditionalsuperscriptsubscript𝐑𝑖𝑛𝑖superscriptsubscript𝐭𝑖𝑛𝑖𝑖0𝑁1𝑆𝐸3[\mathbf{R}_{in}^{i}|\mathbf{t}_{in}^{i}]_{i=0}^{N-1}\in SE(3)[ bold_R start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT | bold_t start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ] start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N - 1 end_POSTSUPERSCRIPT ∈ italic_S italic_E ( 3 ) of a static scene, our goal is to render novel view images photorealisticly, where H𝐻Hitalic_H and W𝑊Witalic_W are the image height and width, N𝑁Nitalic_N is the number of input views available, 𝐑R3×3𝐑superscript𝑅33\mathbf{R}\in\mathit{R}^{3\times 3}bold_R ∈ italic_R start_POSTSUPERSCRIPT 3 × 3 end_POSTSUPERSCRIPT and 𝐭R3×1𝐭superscript𝑅31\mathbf{t}\in\mathit{R}^{3\times 1}bold_t ∈ italic_R start_POSTSUPERSCRIPT 3 × 1 end_POSTSUPERSCRIPT represent the rotation matrix and translation vector.

As described in motivation, in this paper we use MPIs to represent the scene. Different from most MPI-based methods that randomly choose one input view as the reference view and the left input views as the target views [14, 21, 44, 50, 62], we propose to treat every input view as the reference view respectively and construct their corresponding MPIs {𝐌i}i=0N1superscriptsubscriptsubscript𝐌𝑖𝑖0𝑁1\{\mathbf{M}_{i}\}_{i=0}^{N-1}{ bold_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N - 1 end_POSTSUPERSCRIPT (i.e., per-view MPI), for the purpose of building depth-aware consistency across different input views. We adopt MLPs to present the MPIs following previous works [50, 21].

Specifically, considering the camera parameter [𝐑t|𝐭t]delimited-[]conditionalsubscript𝐑𝑡subscript𝐭𝑡[\mathbf{R}_{t}|\mathbf{t}_{t}][ bold_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_t start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] of one target view and the i𝑖iitalic_i-th MPI 𝐌isubscript𝐌𝑖\mathbf{M}_{i}bold_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT corresponding to 𝐈inisuperscriptsubscript𝐈𝑖𝑛𝑖\mathbf{I}_{in}^{i}bold_I start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT that has D𝐷Ditalic_D planes. When a ray 𝐫𝐫\mathbf{r}bold_r is cast from the camera origin 𝐨𝐨\mathbf{o}bold_o of the target view through one pixel at its image plane whose coordinate is (ut,vt)subscript𝑢𝑡subscript𝑣𝑡(u_{t},v_{t})( italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) along direction 𝐝i(ut,vt)superscriptsubscript𝐝𝑖subscript𝑢𝑡subscript𝑣𝑡\mathbf{d}_{i}^{(u_{t},v_{t})}bold_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT, it will have D𝐷Ditalic_D intersections with the D𝐷Ditalic_D planes of 𝐌isubscript𝐌𝑖\mathbf{M}_{i}bold_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, which are denoted as {𝐱ik=(uik,vik,zik)}k=0D1superscriptsubscriptsuperscriptsubscript𝐱𝑖𝑘superscriptsubscript𝑢𝑖𝑘superscriptsubscript𝑣𝑖𝑘superscriptsubscript𝑧𝑖𝑘𝑘0𝐷1\{\mathbf{x}_{i}^{k}=(u_{i}^{k},v_{i}^{k},z_{i}^{k})\}_{k=0}^{D-1}{ bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT = ( italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) } start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D - 1 end_POSTSUPERSCRIPT, where (uik,vik)superscriptsubscript𝑢𝑖𝑘superscriptsubscript𝑣𝑖𝑘(u_{i}^{k},v_{i}^{k})( italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) is the pixel coordinate of the k𝑘kitalic_k-th intersection and ziksuperscriptsubscript𝑧𝑖𝑘z_{i}^{k}italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT represents the depth that plane k𝑘kitalic_k is placed. The pixel coordinate of each intersection can be computed by the inverse homography war** operation [14, 21, 44, 62, 50], which is formulated as follows:

[uikvik1]𝐊ini(𝐑𝐭𝐧zik)𝐊t1[utvt1],similar-todelimited-[]superscriptsubscript𝑢𝑖𝑘superscriptsubscript𝑣𝑖𝑘1superscriptsubscript𝐊𝑖𝑛𝑖superscript𝐑superscript𝐭superscript𝐧topsuperscriptsubscript𝑧𝑖𝑘superscriptsubscript𝐊𝑡1delimited-[]subscript𝑢𝑡subscript𝑣𝑡1\centering\left[\begin{array}[]{c}u_{i}^{k}\\ v_{i}^{k}\\ 1\end{array}\right]\sim\mathbf{K}_{in}^{i}\left(\mathbf{R}^{\prime}-\frac{% \mathbf{t}^{\prime}\mathbf{n}^{\top}}{z_{i}^{k}}\right)\mathbf{K}_{t}^{-1}% \left[\begin{array}[]{c}u_{t}\\ v_{t}\\ 1\end{array}\right],\@add@centering[ start_ARRAY start_ROW start_CELL italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL 1 end_CELL end_ROW end_ARRAY ] ∼ bold_K start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( bold_R start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - divide start_ARG bold_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT bold_n start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT end_ARG start_ARG italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_ARG ) bold_K start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT [ start_ARRAY start_ROW start_CELL italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL 1 end_CELL end_ROW end_ARRAY ] , (6)

where 𝐊iniR3×3superscriptsubscript𝐊𝑖𝑛𝑖superscript𝑅33\mathbf{K}_{in}^{i}\in\mathit{R}^{3\times 3}bold_K start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ∈ italic_R start_POSTSUPERSCRIPT 3 × 3 end_POSTSUPERSCRIPT and 𝐊tR3×3subscript𝐊𝑡superscript𝑅33\mathbf{K}_{t}\in\mathit{R}^{3\times 3}bold_K start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ italic_R start_POSTSUPERSCRIPT 3 × 3 end_POSTSUPERSCRIPT are the camera intrinsics for the input view 𝐈inisuperscriptsubscript𝐈𝑖𝑛𝑖\mathbf{I}_{in}^{i}bold_I start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT and target view respectively, 𝐧=[0,0,1]𝐧superscript001top\mathbf{n}=[0,0,1]^{\top}bold_n = [ 0 , 0 , 1 ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT is the normal vector of the k𝑘kitalic_k-th plane, 𝐑superscript𝐑\mathbf{R}^{\prime}bold_R start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and 𝐭superscript𝐭\mathbf{t}^{\prime}bold_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT are the relative camera extrinsic from the target view to the input view, which is computed as follows:

[𝐑3×3𝐭3×1𝟎1×31]=[𝐑t𝐭t𝟎1×31]1[𝐑ini𝐭ini𝟎1×31].matrixsubscriptsuperscript𝐑33subscriptsuperscript𝐭31subscript0131superscriptmatrixsubscript𝐑𝑡subscript𝐭𝑡subscript01311matrixsuperscriptsubscript𝐑𝑖𝑛𝑖superscriptsubscript𝐭𝑖𝑛𝑖subscript0131\centering\begin{bmatrix}\mathbf{R}^{\prime}_{3\times 3}&\mathbf{t}^{\prime}_{% 3\times 1}\\ \mathbf{0}_{1\times 3}&1\\ \end{bmatrix}=\begin{bmatrix}\mathbf{R}_{t}&\mathbf{t}_{t}\\ \mathbf{0}_{1\times 3}&1\\ \end{bmatrix}^{-1}\begin{bmatrix}\mathbf{R}_{in}^{i}&\mathbf{t}_{in}^{i}\\ \mathbf{0}_{1\times 3}&1\\ \end{bmatrix}.\@add@centering[ start_ARG start_ROW start_CELL bold_R start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 3 × 3 end_POSTSUBSCRIPT end_CELL start_CELL bold_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 3 × 1 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL bold_0 start_POSTSUBSCRIPT 1 × 3 end_POSTSUBSCRIPT end_CELL start_CELL 1 end_CELL end_ROW end_ARG ] = [ start_ARG start_ROW start_CELL bold_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_CELL start_CELL bold_t start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL bold_0 start_POSTSUBSCRIPT 1 × 3 end_POSTSUBSCRIPT end_CELL start_CELL 1 end_CELL end_ROW end_ARG ] start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT [ start_ARG start_ROW start_CELL bold_R start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_CELL start_CELL bold_t start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL bold_0 start_POSTSUBSCRIPT 1 × 3 end_POSTSUBSCRIPT end_CELL start_CELL 1 end_CELL end_ROW end_ARG ] . (7)

With the computed coordinate 𝐱iksuperscriptsubscript𝐱𝑖𝑘\mathbf{x}_{i}^{k}bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT of each intersection along the ray 𝐫𝐫\mathbf{r}bold_r whose direction is 𝐝i(ut,vt)superscriptsubscript𝐝𝑖subscript𝑢𝑡subscript𝑣𝑡\mathbf{d}_{i}^{(u_{t},v_{t})}bold_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT, both 𝐱iksuperscriptsubscript𝐱𝑖𝑘\mathbf{x}_{i}^{k}bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT and 𝐝i(ut,vt)superscriptsubscript𝐝𝑖subscript𝑢𝑡subscript𝑣𝑡\mathbf{d}_{i}^{(u_{t},v_{t})}bold_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT are fed into the MLP Fθisuperscriptsubscript𝐹𝜃𝑖F_{\theta}^{i}italic_F start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT to estimate its color 𝐜iksuperscriptsubscript𝐜𝑖𝑘\mathbf{c}_{i}^{k}bold_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT and visibility αiksuperscriptsubscript𝛼𝑖𝑘\alpha_{i}^{k}italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT as shown in Fig. CMC: Few-shot Novel View Synthesis via Cross-view Multiplane Consistency, which is denoted as:

𝐜ik,αik=Fθi(γ(𝐱ik),γ(𝐝i(ut,vt))),superscriptsubscript𝐜𝑖𝑘superscriptsubscript𝛼𝑖𝑘superscriptsubscript𝐹𝜃𝑖𝛾superscriptsubscript𝐱𝑖𝑘𝛾superscriptsubscript𝐝𝑖subscript𝑢𝑡subscript𝑣𝑡\centering\mathbf{c}_{i}^{k},\alpha_{i}^{k}=F_{\theta}^{i}(\gamma(\mathbf{x}_{% i}^{k}),\gamma(\mathbf{d}_{i}^{(u_{t},v_{t})})),\@add@centeringbold_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT = italic_F start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( italic_γ ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) , italic_γ ( bold_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT ) ) , (8)

where γ𝛾\gammaitalic_γ is the position encoding operation [25] that is formulated as follows:

γ(𝐱)=(sin(20𝐱),cos(20𝐱),,sin(2L1𝐱),cos(2L1𝐱)),𝛾𝐱superscript20𝐱superscript20𝐱superscript2𝐿1𝐱superscript2𝐿1𝐱\centering\gamma(\mathbf{x})=(\sin(2^{0}\mathbf{x}),\cos(2^{0}\mathbf{x}),% \cdots,\sin(2^{L-1}\mathbf{x}),\cos(2^{L-1}\mathbf{x})),\@add@centeringitalic_γ ( bold_x ) = ( roman_sin ( 2 start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT bold_x ) , roman_cos ( 2 start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT bold_x ) , ⋯ , roman_sin ( 2 start_POSTSUPERSCRIPT italic_L - 1 end_POSTSUPERSCRIPT bold_x ) , roman_cos ( 2 start_POSTSUPERSCRIPT italic_L - 1 end_POSTSUPERSCRIPT bold_x ) ) , (9)

L𝐿Litalic_L is the hand-crafted hyperparameter. Then the color 𝐂i(𝐫)subscript𝐂𝑖𝐫\mathbf{C}_{i}(\mathbf{r})bold_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_r ) and depth Zi(𝐫)subscript𝑍𝑖𝐫Z_{i}(\mathbf{r})italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_r ) of the pixel (ut,vt)subscript𝑢𝑡subscript𝑣𝑡(u_{t},v_{t})( italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) in the target view can be rendered based on volume rendering by the i𝑖iitalic_i-th MPI.

3.2 Cross-view Consistency on Multplanes

Reconstruction Loss for Input Views.

Given the rendered color 𝐂i(𝐫)subscript𝐂𝑖𝐫\mathbf{C}_{i}(\mathbf{r})bold_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_r ), if the target view is one of the input views, then the reconstruction loss (Eq. 3) that minimizes the difference from 𝐂i(𝐫)subscript𝐂𝑖𝐫\mathbf{C}_{i}(\mathbf{r})bold_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_r ) to the ground truth color 𝐂gtsubscript𝐂𝑔𝑡\mathbf{C}_{gt}bold_C start_POSTSUBSCRIPT italic_g italic_t end_POSTSUBSCRIPT is adopted, which is denoted as follows:

MSE=1|N|1||i=0N1𝐫𝐂i(𝐫)𝐂gt22,subscriptMSE1𝑁1superscriptsubscript𝑖0𝑁1subscript𝐫superscriptsubscriptnormsubscript𝐂𝑖𝐫subscript𝐂𝑔𝑡22\centering\mathcal{L}_{\text{MSE}}=\frac{1}{|{N}|}\frac{1}{|\mathcal{R}|}\sum_% {i=0}^{N-1}\sum_{\mathbf{r}\in\mathcal{R}}\|\mathbf{C}_{i}(\mathbf{r})-\mathbf% {C}_{gt}\|_{2}^{2},\@add@centeringcaligraphic_L start_POSTSUBSCRIPT MSE end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG | italic_N | end_ARG divide start_ARG 1 end_ARG start_ARG | caligraphic_R | end_ARG ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT bold_r ∈ caligraphic_R end_POSTSUBSCRIPT ∥ bold_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_r ) - bold_C start_POSTSUBSCRIPT italic_g italic_t end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (10)

where \mathcal{R}caligraphic_R is a set of sampling rays. In Sec. 4.3, we verify that with our multiplane representation, the reconstruction loss alone can overcome the overfitting problem well.

Appearance and Depth Consistency Loss for Unseen Views.

The above reconstruction loss utilizes consistency across known input views by forcing the spatial points to be sampled on the same planes. To obtain depth-aware consistency across views, we propose the appearance and depth consistency loss across unseen novel views.

Specifically, when the target view is a novel view that has no ground truth color, it still can be rendered by any MPI and deserve to have the same color and depth map, as shown in Fig. CMC: Few-shot Novel View Synthesis via Cross-view Multiplane Consistency. Based on such an observation, we propose the following loss functions:

ac=2|N(N1)|1||i=0N1j=i+1N1𝐫𝐂i(𝐫)𝐂j(𝐫)22,subscriptac2𝑁𝑁11superscriptsubscript𝑖0𝑁1superscriptsubscript𝑗𝑖1𝑁1subscript𝐫superscriptsubscriptnormsubscript𝐂𝑖𝐫subscript𝐂𝑗𝐫22\centering\mathcal{L}_{\mathrm{ac}}=\frac{2}{|{N(N-1)}|}\frac{1}{|\mathcal{R}|% }\sum_{i=0}^{N-1}\sum_{j=i+1}^{N-1}\sum_{\mathbf{r}\in\mathcal{R}}\|\mathbf{C}% _{i}(\mathbf{r})-\mathbf{C}_{j}(\mathbf{r})\|_{2}^{2},\@add@centeringcaligraphic_L start_POSTSUBSCRIPT roman_ac end_POSTSUBSCRIPT = divide start_ARG 2 end_ARG start_ARG | italic_N ( italic_N - 1 ) | end_ARG divide start_ARG 1 end_ARG start_ARG | caligraphic_R | end_ARG ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = italic_i + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT bold_r ∈ caligraphic_R end_POSTSUBSCRIPT ∥ bold_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_r ) - bold_C start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( bold_r ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (11)
dc=2|N(N1)|1||i=0N1j=i+1N1𝐫Zi(𝐫)Zj(𝐫)22,subscriptdc2𝑁𝑁11superscriptsubscript𝑖0𝑁1superscriptsubscript𝑗𝑖1𝑁1subscript𝐫superscriptsubscriptnormsubscript𝑍𝑖𝐫subscript𝑍𝑗𝐫22\centering\mathcal{L}_{\mathrm{dc}}=\frac{2}{|{N(N-1)}|}\frac{1}{|\mathcal{R}|% }\sum_{i=0}^{N-1}\sum_{j=i+1}^{N-1}\sum_{\mathbf{r}\in\mathcal{R}}\|{Z}_{i}(% \mathbf{r})-{Z}_{j}(\mathbf{r})\|_{2}^{2},\@add@centeringcaligraphic_L start_POSTSUBSCRIPT roman_dc end_POSTSUBSCRIPT = divide start_ARG 2 end_ARG start_ARG | italic_N ( italic_N - 1 ) | end_ARG divide start_ARG 1 end_ARG start_ARG | caligraphic_R | end_ARG ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = italic_i + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT bold_r ∈ caligraphic_R end_POSTSUBSCRIPT ∥ italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_r ) - italic_Z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( bold_r ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (12)

where 𝐂isubscript𝐂𝑖\mathbf{C}_{i}bold_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT/Zisubscript𝑍𝑖{Z}_{i}italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and 𝐂jsubscript𝐂𝑗\mathbf{C}_{j}bold_C start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT/Zjsubscript𝑍𝑗{Z}_{j}italic_Z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT represents the rendered colors and depths by the i𝑖iitalic_i-th MPI and j𝑗jitalic_j-th MPI respectively.

As a result, the whole loss function of our proposed method can be expressed as follows:

=MSE+λacac+λdcdc,subscriptMSEsubscript𝜆𝑎𝑐subscriptacsubscript𝜆𝑑𝑐subscriptdc\centering\mathcal{L}=\mathcal{L}_{\text{MSE}}+\lambda_{ac}\mathcal{L}_{% \mathrm{ac}}+\lambda_{dc}\mathcal{L}_{\mathrm{dc}},\@add@centeringcaligraphic_L = caligraphic_L start_POSTSUBSCRIPT MSE end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT italic_a italic_c end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT roman_ac end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT italic_d italic_c end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT roman_dc end_POSTSUBSCRIPT , (13)

where λacsubscript𝜆𝑎𝑐\lambda_{ac}italic_λ start_POSTSUBSCRIPT italic_a italic_c end_POSTSUBSCRIPT and λdcsubscript𝜆𝑑𝑐\lambda_{dc}italic_λ start_POSTSUBSCRIPT italic_d italic_c end_POSTSUBSCRIPT are hyperparametes that balance the weights of acsubscriptac\mathcal{L}_{\mathrm{ac}}caligraphic_L start_POSTSUBSCRIPT roman_ac end_POSTSUBSCRIPT and dcsubscriptdc\mathcal{L}_{\mathrm{dc}}caligraphic_L start_POSTSUBSCRIPT roman_dc end_POSTSUBSCRIPT.

Table 1: Quantitative comparisons on 8 scenes of the Shiny dataset.
Method NeRF [25] DietNeRF [16] InfoNeRF [17] Ours \bigstrut
Scene PSNR\uparrow SSIM\uparrow LPIPS\downarrow PSNR\uparrow SSIM\uparrow LPIPS\downarrow PSNR\uparrow SSIM\uparrow LPIPS\downarrow PSNR\uparrow SSIM\uparrow LPIPS\downarrow \bigstrut
Cake 15.98 0.514 0.576 18.04 0.556 0.543 14.71 0.469 0.653 17.08 0.564 0.496 \bigstrut[t]
Crest 11.50 0.152 0.729 9.74 0.105 0.733 12.28 0.181 0.736 14.54 0.268 0.564
Food 12.65 0.296 0.657 10.30 0.190 0.736 13.25 0.328 0.679 16.00 0.425 0.502
Giants 12.39 0.218 0.730 12.54 0.216 0.733 6.32 0.010 0.776 13.42 0.299 0.651
Pasta 13.95 0.370 0.550 13.96 0.373 0.545 13.84 0.353 0.632 14.89 0.389 0.523
Room 21.19 0.710 0.454 20.01 0.669 0.483 18.99 0.578 0.638 22.59 0.750 0.378
Seasoning 12.27 0.358 0.684 12.05 0.347 0.682 12.62 0.384 0.684 13.05 0.447 0.605
Tools 15.04 0.580 0.500 8.35 0.276 0.717 10.89 0.358 0.65 16.23 0.598 0.411 \bigstrut[b]
Average 14.37 0.399 0.610 13.12 0.341 0.646 12.86 0.332 0.681 15.98 0.468 0.516 \bigstrut

3.3 Weighted Rendering

To render a target view from multiple MPIs, based on the assumption that the closest MPI to the target view should have a greater impact on its rendering process, we adopt a weighted rendering strategy. Specifically, the final output 𝐂(𝐫)𝐂𝐫\mathbf{C}(\mathbf{r})bold_C ( bold_r ) is obtained by calculating a weighted average of the rendering colors from different MPIs, which is denoted as follows:

𝐂(𝐫)=i=0N1wi𝐂i(𝐫),𝐂𝐫superscriptsubscript𝑖0𝑁1subscript𝑤𝑖subscript𝐂𝑖𝐫\centering\mathbf{C}(\mathbf{r})=\sum_{i=0}^{N-1}w_{i}\cdot\mathbf{C}_{i}(% \mathbf{r}),\@add@centeringbold_C ( bold_r ) = ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N - 1 end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ bold_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_r ) , (14)

where 𝐂i(𝐫)subscript𝐂𝑖𝐫\mathbf{C}_{i}(\mathbf{r})bold_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_r ) is the color rendered by the i𝑖iitalic_i-th MPI, wisubscript𝑤𝑖w_{i}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the weight calculated according to the distance μisubscript𝜇𝑖\mu_{i}italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT from the i𝑖iitalic_i-th MPI to the target view, which is formulated as follows:

wi=μij=0N1μj,μi=𝐨t𝐨i22,formulae-sequencesubscript𝑤𝑖subscript𝜇𝑖superscriptsubscript𝑗0𝑁1subscript𝜇𝑗subscript𝜇𝑖superscriptsubscriptnormsubscript𝐨𝑡subscript𝐨𝑖22\centering w_{i}=\frac{\mu_{i}}{\sum_{j=0}^{N-1}\mu_{j}},\ \mu_{i}=||\mathbf{o% }_{t}-\mathbf{o}_{i}||_{2}^{2},\@add@centeringitalic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = divide start_ARG italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N - 1 end_POSTSUPERSCRIPT italic_μ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG , italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = | | bold_o start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - bold_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (15)

where 𝐨tsubscript𝐨𝑡\mathbf{o}_{t}bold_o start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and 𝐨isubscript𝐨𝑖\mathbf{o}_{i}bold_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT represent the camera origins of the target view and the i𝑖iitalic_i-th MPI respectively.

3.4 Analysis on Cross-view Multiplane Consistency

To demonstrate the effectiveness of our proposed method, we make an analysis of CMC in this section. To begin with, we propose an assumption for the overfitting problem of NeRF under the few-shot setting. Specifically, given sparse input views, as shown in Fig. 1(b), a fact is that it is quite difficult for rays of different views to have the same sampling points due to the random uniform sampling strategy of NeRF [25], which is denoted as follows:

ti𝒰[tn+i1M(tftn),tn+iM(tftn)],similar-tosubscript𝑡𝑖𝒰subscript𝑡𝑛𝑖1𝑀subscript𝑡𝑓subscript𝑡𝑛subscript𝑡𝑛𝑖𝑀subscript𝑡𝑓subscript𝑡𝑛\centering t_{i}\sim\mathcal{U}\left[t_{n}+\frac{i-1}{M}\left(t_{f}-t_{n}% \right),t_{n}+\frac{i}{M}\left(t_{f}-t_{n}\right)\right],\@add@centeringitalic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ caligraphic_U [ italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT + divide start_ARG italic_i - 1 end_ARG start_ARG italic_M end_ARG ( italic_t start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT - italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) , italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT + divide start_ARG italic_i end_ARG start_ARG italic_M end_ARG ( italic_t start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT - italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ] , (16)

where tnsubscript𝑡𝑛t_{n}italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT and tfsubscript𝑡𝑓t_{f}italic_t start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT are the near and far bounds, M𝑀Mitalic_M is the number of sampling points along the ray and tisubscript𝑡𝑖t_{i}italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the i𝑖iitalic_i-th sampling points. As a result, our assumption is that the sampling points in each ray would only take part in the rendering process of pixels corresponding to this ray, while rarely being used to render other views. Thus, the optimization process of NeRF [25] (Eq. 2) can be viewed as solving the following equation for each ray independently:

𝐂gt=i=1MT(σi)f(σi)𝐜i,subscript𝐂𝑔𝑡superscriptsubscript𝑖1𝑀𝑇subscript𝜎𝑖𝑓subscript𝜎𝑖subscript𝐜𝑖\centering\mathbf{C}_{gt}=\sum_{i=1}^{M}T(\sigma_{i})f(\sigma_{i})\mathbf{c}_{% i},\@add@centeringbold_C start_POSTSUBSCRIPT italic_g italic_t end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_T ( italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) italic_f ( italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) bold_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , (17)

where T𝑇Titalic_T and f𝑓fitalic_f are both functions of σisubscript𝜎𝑖\sigma_{i}italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, 𝐂gtsubscript𝐂𝑔𝑡\mathbf{C}_{gt}bold_C start_POSTSUBSCRIPT italic_g italic_t end_POSTSUBSCRIPT is the ground-truth pixel of ray 𝐫𝐫\mathbf{r}bold_r, σisubscript𝜎𝑖\sigma_{i}italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and 𝐜isubscript𝐜𝑖\mathbf{c}_{i}bold_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are unknowns to be estimated. Obviously, we have 2M2𝑀2M2 italic_M unknowns while only one equation, which means that infinite solutions exist for this problem. Considering the memorization nature of neural networks [58, 2] and Occam’s Razor [3, 31], NeRF tends to converge to the simplest way to represent known input views, thus Eq. 17 is assumed to solve the following sparse optimization problem:

𝐜i*,σi*=argmin𝐜i,σi{||i=1MT(σi)f(σi)𝐜i𝐂gt||22+i=1M||𝐜i||0+i=1M||σi||0},\centering\begin{aligned} \mathbf{c}_{i}^{*},\sigma_{i}^{*}=\mathop{\arg\min}_% {\mathbf{c}_{i},\sigma_{i}}\{||&\sum_{i=1}^{M}T(\sigma_{i})f(\sigma_{i})% \mathbf{c}_{i}-\mathbf{C}_{gt}||_{2}^{2}\\ +&\sum_{i=1}^{M}||\mathbf{c}_{i}||_{0}+\sum_{i=1}^{M}||\sigma_{i}||_{0}\}\end{% aligned},\@add@centeringstart_ROW start_CELL bold_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT , italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT = start_BIGOP roman_arg roman_min end_BIGOP start_POSTSUBSCRIPT bold_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT { | | end_CELL start_CELL ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_T ( italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) italic_f ( italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) bold_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - bold_C start_POSTSUBSCRIPT italic_g italic_t end_POSTSUBSCRIPT | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL + end_CELL start_CELL ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT | | bold_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | | start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT | | italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | | start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT } end_CELL end_ROW , (18)

whose solution is

{𝐜i*,σi*}={{𝐂gt,1},i=0{𝟎,0},i1,\{\mathbf{c}_{i}^{*},\sigma_{i}^{*}\}=\left\{\begin{aligned} &\{\mathbf{C}_{gt% },1\},&i=0\\ &\{\mathbf{0},0\},&i\geq 1\end{aligned}\right.,{ bold_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT , italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT } = { start_ROW start_CELL end_CELL start_CELL { bold_C start_POSTSUBSCRIPT italic_g italic_t end_POSTSUBSCRIPT , 1 } , end_CELL start_CELL italic_i = 0 end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL { bold_0 , 0 } , end_CELL start_CELL italic_i ≥ 1 end_CELL end_ROW , (19)

which thus leads to the overfitting problem.

Based on the analysis above, to overcome such a problem, a direct way is to impose the same point to be sampled in rays of different views. Take two input views I0subscript𝐼0I_{0}italic_I start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and I1subscript𝐼1I_{1}italic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT as an example, whose camera origins are 𝐨0subscript𝐨0\mathbf{o}_{0}bold_o start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and 𝐨1subscript𝐨1\mathbf{o}_{1}bold_o start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT respectively. For one sampling point 𝐱0=𝐨0+t0𝐝𝟎subscript𝐱0subscript𝐨0subscript𝑡0subscript𝐝0\mathbf{x}_{0}=\mathbf{o}_{0}+t_{0}\mathbf{d_{0}}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = bold_o start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT bold_d start_POSTSUBSCRIPT bold_0 end_POSTSUBSCRIPT along ray 𝐝𝟎subscript𝐝0\mathbf{d_{0}}bold_d start_POSTSUBSCRIPT bold_0 end_POSTSUBSCRIPT of I0subscript𝐼0I_{0}italic_I start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, our goal is that 𝐱0subscript𝐱0\mathbf{x}_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT can also be sampled in I1subscript𝐼1I_{1}italic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT along ray 𝐝𝟏subscript𝐝1\mathbf{d_{1}}bold_d start_POSTSUBSCRIPT bold_1 end_POSTSUBSCRIPT, thus guarantee that the same sampling point can take part in the rendering process of pixels in different views. Assuming that the sampling point in ray 𝐝𝟏subscript𝐝1\mathbf{d_{1}}bold_d start_POSTSUBSCRIPT bold_1 end_POSTSUBSCRIPT is denoted as 𝐱1=𝐨1+t1𝐝𝟏subscript𝐱1subscript𝐨1subscript𝑡1subscript𝐝1\mathbf{x}_{1}=\mathbf{o}_{1}+t_{1}\mathbf{d_{1}}bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = bold_o start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_d start_POSTSUBSCRIPT bold_1 end_POSTSUBSCRIPT, then the problem can be converted into the following formulation (i.e., find the optimal t1subscript𝑡1t_{1}italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT that can minimize the distances between 𝐱0subscript𝐱0\mathbf{x}_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and 𝐱1subscript𝐱1\mathbf{x}_{1}bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT):

t1*=argmint1superscriptsubscript𝑡1subscriptsubscript𝑡1\displaystyle t_{1}^{*}=\mathop{\arg\min}_{t_{1}}italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT = start_BIGOP roman_arg roman_min end_BIGOP start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT 𝐱1𝐱022,t1[tn,tf],superscriptsubscriptnormsubscript𝐱1subscript𝐱022subscript𝑡1subscript𝑡𝑛subscript𝑡𝑓\displaystyle||\mathbf{x}_{1}-\mathbf{x}_{0}||_{2}^{2},\,t_{1}\in[t_{n},t_{f}],| | bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∈ [ italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ] , (20)

where 𝐱0subscript𝐱0\mathbf{x}_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, 𝐨1subscript𝐨1\mathbf{o}_{1}bold_o start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and 𝐝𝟏subscript𝐝1\mathbf{d_{1}}bold_d start_POSTSUBSCRIPT bold_1 end_POSTSUBSCRIPT is known. As a result, our goal is to find the optimal t1subscript𝑡1t_{1}italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT that can satisfy the following formulation:

𝐱1𝐱022δ,superscriptsubscriptnormsubscript𝐱1subscript𝐱022𝛿\centering\begin{aligned} ||\mathbf{x}_{1}-\mathbf{x}_{0}||_{2}^{2}\leq\delta% \end{aligned},\@add@centeringstart_ROW start_CELL | | bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ italic_δ end_CELL end_ROW , (21)

where δ0𝛿0\delta\to 0italic_δ → 0.

Assuming that 𝐨0={o0x,o0y,o0z}subscript𝐨0superscriptsubscript𝑜0𝑥superscriptsubscript𝑜0𝑦superscriptsubscript𝑜0𝑧\mathbf{o}_{0}=\{o_{0}^{x},o_{0}^{y},o_{0}^{z}\}bold_o start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = { italic_o start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT , italic_o start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_y end_POSTSUPERSCRIPT , italic_o start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT }, 𝐝0={d0x,d0y,d0z}subscript𝐝0superscriptsubscript𝑑0𝑥superscriptsubscript𝑑0𝑦superscriptsubscript𝑑0𝑧\mathbf{d}_{0}=\{d_{0}^{x},d_{0}^{y},d_{0}^{z}\}bold_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = { italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT , italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_y end_POSTSUPERSCRIPT , italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT }, 𝐨1={o1x,o1y,o1z}subscript𝐨1superscriptsubscript𝑜1𝑥superscriptsubscript𝑜1𝑦superscriptsubscript𝑜1𝑧\mathbf{o}_{1}=\{o_{1}^{x},o_{1}^{y},o_{1}^{z}\}bold_o start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = { italic_o start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT , italic_o start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_y end_POSTSUPERSCRIPT , italic_o start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT } and 𝐝1={d1x,d1y,d1z}subscript𝐝1superscriptsubscript𝑑1𝑥superscriptsubscript𝑑1𝑦superscriptsubscript𝑑1𝑧\mathbf{d}_{1}=\{d_{1}^{x},d_{1}^{y},d_{1}^{z}\}bold_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = { italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT , italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_y end_POSTSUPERSCRIPT , italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT }, then Eq. 21 can be converted into the following formulation:

||{o0x+t0d0x,o0y+t0d0y,o0z+t0d0z}{o1x+t1d1x,o1y+t1d1y,o1z+t1d1z}||22δ.\centering\begin{aligned} &||\{o_{0}^{x}+t_{0}d_{0}^{x},o_{0}^{y}+t_{0}d_{0}^{% y},o_{0}^{z}+t_{0}d_{0}^{z}\}-\\ &\{o_{1}^{x}+t_{1}d_{1}^{x},o_{1}^{y}+t_{1}d_{1}^{y},o_{1}^{z}+t_{1}d_{1}^{z}% \}||_{2}^{2}\leq\delta\end{aligned}.\@add@centeringstart_ROW start_CELL end_CELL start_CELL | | { italic_o start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT + italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT , italic_o start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_y end_POSTSUPERSCRIPT + italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_y end_POSTSUPERSCRIPT , italic_o start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT + italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT } - end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL { italic_o start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT + italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT , italic_o start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_y end_POSTSUPERSCRIPT + italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_y end_POSTSUPERSCRIPT , italic_o start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT + italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT } | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ italic_δ end_CELL end_ROW . (22)

Simplifying the above formula, we obtain:

t1*=Φ(t0),superscriptsubscript𝑡1Φsubscript𝑡0\centering\begin{aligned} t_{1}^{*}=\Phi(t_{0})\end{aligned},\@add@centeringstart_ROW start_CELL italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT = roman_Φ ( italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_CELL end_ROW , (23)

where

Φ(u)=((o0x+ud0xo1x)2+(o0y+ud0yo1y)2+(o0z+ud0zo1z)2)1/2,\centering\begin{aligned} \Phi(u)=((o_{0}^{x}&+ud_{0}^{x}-o_{1}^{x})^{2}+(o_{0% }^{y}+ud_{0}^{y}-o_{1}^{y})^{2}\\ &+(o_{0}^{z}+ud_{0}^{z}-o_{1}^{z})^{2})^{1/2}\end{aligned},\@add@centeringstart_ROW start_CELL roman_Φ ( italic_u ) = ( ( italic_o start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT end_CELL start_CELL + italic_u italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT - italic_o start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ( italic_o start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_y end_POSTSUPERSCRIPT + italic_u italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_y end_POSTSUPERSCRIPT - italic_o start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_y end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL + ( italic_o start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT + italic_u italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT - italic_o start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT end_CELL end_ROW , (24)

which means that when sampling points in view I0subscript𝐼0I_{0}italic_I start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT are known, then all sampling points in view I1subscript𝐼1I_{1}italic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT should be deterministic. Fortunately, this is exactly the nature of multiplane images. When view I0subscript𝐼0I_{0}italic_I start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is selected as the reference view to construct the MPI, all the sampling points of different views are deterministic and forced to be distributed on the same planes. As a result, consistency across different views can be well guaranteed. Experiments in Sec. 4.3 also demonstrate the effectiveness of our analysis.

4 Experiments

We make a comparison with various state-of-the-art methods for few-shot novel view synthesis quantitatively and qualitatively. We also present a detailed analysis of the necessity of adopting per-view multiplane images and the appearance/depth consistency loss. See supplementary materials for demonstrations of Eq. 18 and Eq. 20, ablation studies on the influence of different numbers of MPI planes, and more visualization results of novel view synthesis. We only evaluate our proposed method on extremely sparse input views, i.e., 3 input views, as it is the most common case.

Refer to caption
Figure 2: Qualitative comparisons on the Shiny dataset, where our proposed method can achieve better novel view synthesis and accurate geometry estimation (i.e., the depth map).
Refer to caption
Figure 3: Qualitative comparisons on the LLFF dataset. Our proposed method can avoid the overfitting problem, where better novel view synthesis and more continuous depth estimation can be achieved.

4.1 Implementation Details

Datasets.

We perform experiments on the LLFF dataset [24] and the Shiny [50] dataset to validate the effectiveness of our proposed method. Both of the two datasets contain 8 complex real-world scenes with big disparities, while the Shiny dataset is more complicated because it has more view-dependent effects such as reflections and refraction. We follow the experimental protocols provided by [28], where the resolution of both input views and target views are 378 ×\times× 504. To make a fair comparison, similar to previous methods, for each scene we choose every 8-th image as the held-out test set and then select 3 images evenly from the remaining images as the input views. Notably, following [28], in our experiment the sampled input views are distributed uniformly in the camera pose space, where the distances across different input views are almost the same. However, our proposed method can also be applied to scenarios where input views are randomly selected and exhibit greater spatial separation. This flexibility stems from our individual construction of MPIs for each input view, and the capacity of each MPI to render novel views consistently. By assuming that the results of rendered novel views by different MPIs remain the same, regions in the novel view overlap** with input views are effectively constrained, producing coherent and reasonable outcomes.

Training Details.

As we discussed above, we construct per-view MPI with 80 planes for each input view, where each MPI is modeled by one independent four or six-layer leakyrelu-MLP with 256 nodes per layer. We set γ𝛾\gammaitalic_γ=10 for the spatial coordinate 𝐱iksuperscriptsubscript𝐱𝑖𝑘\mathbf{x}_{i}^{k}bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT while no position-encoding for the direction vector 𝐝i(ut,vt)superscriptsubscript𝐝𝑖subscript𝑢𝑡subscript𝑣𝑡\mathbf{d}_{i}^{(u_{t},v_{t})}bold_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT. The initial learning rate is 5×1035superscript1035\times 10^{-3}5 × 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT and then gradually reduce to 1×1041superscript1041\times 10^{-4}1 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT. At the beginning of the training process, we use only the reconstruction loss (Eq. 10) to train the network. After 15 epochs the whole loss function (Eq. 13) with appearance and depth consistency loss is used, where both λacsubscript𝜆𝑎𝑐\lambda_{ac}italic_λ start_POSTSUBSCRIPT italic_a italic_c end_POSTSUBSCRIPT and λdcsubscript𝜆𝑑𝑐\lambda_{dc}italic_λ start_POSTSUBSCRIPT italic_d italic_c end_POSTSUBSCRIPT are set to 1. We train our models with the Adam optimizer with randomly sampled 1024 rays in a batch within 50 epochs by a single NVIDIA RTX 3090 GPU. It takes about 2 hours to train a scene and 10 seconds to render a target view.

Metrics.

We evaluate the quality of rendered novel view images with Peak Signal-to-Noise Ratio (PSNR), Structural Similarity Index Measure (SSIM) [48], and Learned Perceptual Image Patch Similarity (LPIPS) [60]. For easier comparison, we also report the average score by calculating the geometric mean of 10PSNR/10superscript10PSNR1010^{-\text{PSNR}/10}10 start_POSTSUPERSCRIPT - PSNR / 10 end_POSTSUPERSCRIPT, 1SSIM1SSIM\sqrt{1-\text{SSIM}}square-root start_ARG 1 - SSIM end_ARG and LPIPS for the LLFF dataset similar to [28].

Table 2: Quantitative comparisons on the LLFF dataset. Our proposed method can achieve state-of-the-art performance. ft indicates the results fine-tuned on each scene individually.
Method PSNR\uparrow SSIM\uparrow LPIPS\downarrow Average\downarrow \bigstrut
NeRF [25] 13.34 0.373 0.451 0.255 \bigstrut[t]
NeX [50] 17.36 0.591 0.369 0.163 \bigstrut[t]
DietNeRF [16] 14.94 0.370 0.496 0.232
InfoNeRF [17] 14.37 0.349 0.457 0.238
PixelNeRF-ft [57] 16.17 0.438 0.438 0.217
SRF-ft [8] 17.07 0.436 0.529 0.203
MVSNeRF-ft [6] 17.88 0.584 0.327 0.157
GeCoNeRF [18] 18.55 0.578 0.340 0.150
RegNeRF [28] 19.08 0.587 0.336 0.146
MixNeRF [36] 19.27 0.629 0.336 0.134
FlipNeRF [35] 19.34 0.631 0.335 0.133 \bigstrut[b]
Ours 19.45 0.659 0.310 0.127 \bigstrut

4.2 Comparisons with State-of-the-art Methods

4.2.1 Results on the Shiny Dataset

We first compare our proposed method with vanilla NeRF [25], DietNeRF [16] and InfoNeRF [17] on the challenging Shiny dataset proposed by [50] to demonstrate the effectiveness of CMC. We choose 8 real-world scenes from the official shiny and shiny-extended dataset that contain complex view-dependent effects. As shown in Fig. 2, NeRF and DietNeRF will overfit to input views, where the estimated geometry (i.e., the depth map) is quite poor. For InfoNeRF, though it can render a more reasonable depth map, it will fail in more complicated scenes such as "Crest". On account that DietNeRF only uses a high-level semantic loss on the image level to realize consistency across different views, it will generate repeated contents on the rendered novel view image. Differently, InfoNeRF takes advantage of ray entropy loss to regularize the seen/unseen views independently, where no cross-views interactions exist. As a result, for some occluded areas in the novel view that don’t appear in the input views, it is quite difficult for them to estimate reasonable contents. Instead, our proposed method, i.e., CMC, can render accurate depth maps and novel view images by virtue of a fully utilize of cross-view consistency. As demonstrated in Tab. 1, CMC can achieve state-of-the-art performance on all the metrics, which reflects the fact that introducing only physical priors would not be strong enough to deal with complex scenes under the few-shot setting, leverage of cross-view consistency will be helpful for obtaining a more accurate geometry estimation.

4.2.2 Results on the LLFF Dataset

Similar to many previous works, we also perform experiments on the common LLFF dataset against many state-of-the-art methods to demonstrate the superiority of our proposed method. Specifically, we compare our method with pretraining-based methods (i.e., PixelNeRF [57], SRF [8], MVSNeRF [6]), regularization-based methods (i.e., DietNeRF [16], InfoNeRF [17], RegNeRF [28]), pseudo views-based method (i.e., GeCoNeRF [18]) and NeRF [25].

As verified in Tab. 2, our method can still achieve state-of-the-art performance with a big improvement in SSIM. For qualitative comparisons, as shown in Fig. 3, for methods based on pre-trained network such as MVSNeRF, though they can avoid overfitting to input views to some extent, the rendered novel view images would contain unreasonable artifacts due to the domain gap between training dataset and test scenes. Moreover, for input views with quite big disparities, MVSNeRF still falls into overfitting and estimates wrong geometry, as demonstrated by the scene named "Horns". For regularization-based methods such as InfoNeRF, severe artifacts will exist in the generated novel view images. For RegNeRF, the method with the best performance for few-shot novel view synthesis at present, it can overcome the overfitting problem to a large extent by means of depth smoothing regularization and a well-designed sampling annealing strategy. However, RegNeRF still generates some unreasonable geometry and results in discontinuous depth estimation, as demonstrated by the TV and conference table in the scene named "Room". On the contrary, our proposed method can achieve not only photorealistic novel view synthesis but also quite accurate and continuous depth estimation, without any physical priors serving as the regularization term or any hand-crafted complex sampling strategy to avoid overfitting. In other words, our method can realize few-shot novel view synthesis elegantly with lower complexity, which promises many practical applications.

Refer to caption
Figure 4: Qualitative comparisons of different choices of loss functions. (1) Single MPI with MSEsubscriptMSE\mathcal{L}_{\text{MSE}}caligraphic_L start_POSTSUBSCRIPT MSE end_POSTSUBSCRIPT. (2) Per-view MPI with MSEsubscriptMSE\mathcal{L}_{\text{MSE}}caligraphic_L start_POSTSUBSCRIPT MSE end_POSTSUBSCRIPT. (3) Per-view MPI with MSE+dcIsubscriptMSEsuperscriptsubscriptdcI\mathcal{L}_{\text{MSE}}+\mathcal{L}_{\text{dc}}^{\text{I}}caligraphic_L start_POSTSUBSCRIPT MSE end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT dc end_POSTSUBSCRIPT start_POSTSUPERSCRIPT I end_POSTSUPERSCRIPT. (4) Per-view MPI with MSE+dcI+acsubscriptMSEsuperscriptsubscriptdcIsubscriptac\mathcal{L}_{\text{MSE}}+\mathcal{L}_{\text{dc}}^{\text{I}}+\mathcal{L}_{\text% {ac}}caligraphic_L start_POSTSUBSCRIPT MSE end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT dc end_POSTSUBSCRIPT start_POSTSUPERSCRIPT I end_POSTSUPERSCRIPT + caligraphic_L start_POSTSUBSCRIPT ac end_POSTSUBSCRIPT. (5) Per-view MPI with MSE+dcI+ac+dcsubscriptMSEsuperscriptsubscriptdcIsubscriptacsubscriptdc\mathcal{L}_{\text{MSE}}+\mathcal{L}_{\text{dc}}^{\text{I}}+\mathcal{L}_{\text% {ac}}+\mathcal{L}_{\text{dc}}caligraphic_L start_POSTSUBSCRIPT MSE end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT dc end_POSTSUBSCRIPT start_POSTSUPERSCRIPT I end_POSTSUPERSCRIPT + caligraphic_L start_POSTSUBSCRIPT ac end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT dc end_POSTSUBSCRIPT.

4.3 Ablation Studies

To verify the importance of constructing per-view MPI and the appearance/depth consistency loss, we perform ablation studies on the choices of loss functions. Specifically, we choose loss functions (Eq. 13) composed of different combinations of the reconstruction loss MSEsubscriptMSE\mathcal{L}_{\text{MSE}}caligraphic_L start_POSTSUBSCRIPT MSE end_POSTSUBSCRIPT (Eq. 10), the depth consistency loss on input views dcIsuperscriptsubscriptdcI\mathcal{L}_{\text{dc}}^{\text{I}}caligraphic_L start_POSTSUBSCRIPT dc end_POSTSUBSCRIPT start_POSTSUPERSCRIPT I end_POSTSUPERSCRIPT (Eq. 12), the appearance consistency loss on novel views acsubscriptac\mathcal{L}_{\text{ac}}caligraphic_L start_POSTSUBSCRIPT ac end_POSTSUBSCRIPT (Eq. 11) and the depth consistency loss on novel views dcsubscriptdc\mathcal{L}_{\text{dc}}caligraphic_L start_POSTSUBSCRIPT dc end_POSTSUBSCRIPT (Eq. 12).

Single MPI.

As shown in Fig. 4, for the setting of single MPI (Sing. MPI), i.e., only one random input view is selected as the reference view and thus only one MPI is constructed to render novel views, though some artifacts exist, the neural network can already avoid overfitting to input views. Actually, a single MPI is a variant of NeRF where only the sampling points in rays of different input views are imposed to be distributed on the same planes. However, such a slight change can achieve nearly 4dB PSNR improvement over NeRF as demonstrated in Tab. 2 and Tab. 3. This observation reflects the effectiveness and superiority of our method, where cross-view multiplane consistency can benefit a lot for accurate geometry recovery.

Per-view MPI.

To enhance interactions across different input views, we further propose per-view MPI by treating every input view as the reference view and then constructing their corresponding MPIs. As shown in Fig. 4 and Tab. 3, with only MSEsubscriptMSE\mathcal{L}_{\text{MSE}}caligraphic_L start_POSTSUBSCRIPT MSE end_POSTSUBSCRIPT, per-view MPI can witness an increase in rendering quality and generate more accurate geometry estimation, which demonstrates the effectiveness of per-view MPI. Then, when we successively add dcIsuperscriptsubscriptdcI\mathcal{L}_{\text{dc}}^{\text{I}}caligraphic_L start_POSTSUBSCRIPT dc end_POSTSUBSCRIPT start_POSTSUPERSCRIPT I end_POSTSUPERSCRIPT, acsubscriptac\mathcal{L}_{\text{ac}}caligraphic_L start_POSTSUBSCRIPT ac end_POSTSUBSCRIPT and dcsubscriptdc\mathcal{L}_{\text{dc}}caligraphic_L start_POSTSUBSCRIPT dc end_POSTSUBSCRIPT into the loss function, a continuous growth of performance can be observed, where more photorealistic novel view images and better depth estimation can be achieved.

Table 3: Ablation studies on the choices of different loss functions.
Loss Sing. MPI Per-view MPI \bigstrut
MSEsubscriptMSE\mathcal{L}_{\text{MSE}}caligraphic_L start_POSTSUBSCRIPT MSE end_POSTSUBSCRIPT \bigstrut[t]
dcIsuperscriptsubscriptdcI\mathcal{L}_{\text{dc}}^{\text{I}}caligraphic_L start_POSTSUBSCRIPT dc end_POSTSUBSCRIPT start_POSTSUPERSCRIPT I end_POSTSUPERSCRIPT
acsubscriptac\mathcal{L}_{\text{ac}}caligraphic_L start_POSTSUBSCRIPT ac end_POSTSUBSCRIPT
dcsubscriptdc\mathcal{L}_{\text{dc}}caligraphic_L start_POSTSUBSCRIPT dc end_POSTSUBSCRIPT \bigstrut[b]
PSNR\uparrow 17.56 18.33 18.69 19.24 19.27 19.45
SSIM\uparrow 0.597 0.618 0.634 0.656 0.656 0.659
LPIPS\downarrow 0.359 0.345 0.334 0.336 0.321 0.310
Average\downarrow 0.158 0.146 0.139 0.133 0.130 0.129 \bigstrut[b]

5 Conclusions

We present a brand-new technique for few-shot novel view synthesis by cross-view multiplane consistency (i.e., CMC). We propose to address the overfitting problem of few-shot view synthesis by forcing the sampling points to be the identical when rendering different views through multiplane images. This is based on the assumption that given sparse input view images, the sampling point in each ray would rarely be used to render other views and thus cause the neural networks to memorize input views rather than learn the underlying geometry. Then, to enhance interactions across different views, we propose to construct per-view MPI by viewing every input view as the reference view followed by leverage of appearance and depth consistency loss. We further provide an explanation for the overfitting problem and give the intuition behind CMC. To verify our assumption and method, we conduct experiments on a large amount of complex real-world scenes, where our proposed CMC can achieve state-of-the-art few-shot novel view synthesis, without any scene priors or complicated hand-crafted sampling strategy.

6 Limitations and future works

Limitations.

The main limitation of our proposed method is that CMC doesn’t perform well on surrounding scenes that contain big camera rotations, such as the Blender dataset [25]. This is because CMC is based on Multiplane Images (i.e., MPI), which is specially designed for forward-facing scenes while not suitable to represent surrounding scenes.

Future works

Our future works include extending CMC to surrounding scenes using methods such as multisphere representation, where sampling points are forced to be distributed on the same spheres. Moreover, we will try to use only one MLP instead of per-view MPI to represent the scene, which would decrease the training burden to a large extent.

7 Acknowledgement

This work was supported in part by NSFC under Grant 62371434, U1908209, 62021001.

References

  • [1] Y. C. Ahn, S. Jang, S. Park, J.-Y. Kim, and N. Kang. Panerf: Pseudo-view augmentation for improved neural radiance fields based on few-shot inputs. arXiv preprint arXiv:2211.12758, 2022.
  • [2] D. Arpit, S. Jastrzębski, N. Ballas, D. Krueger, E. Bengio, M. S. Kanwal, T. Maharaj, A. Fischer, A. Courville, Y. Bengio, et al. A closer look at memorization in deep networks. In International conference on machine learning, pp. 233–242. PMLR, 2017.
  • [3] A. Blumer, A. Ehrenfeucht, D. Haussler, and M. K. Warmuth. Occam’s razor. Information processing letters, 24(6):377–380, 1987.
  • [4] C. Buehler, M. Bosse, L. McMillan, S. Gortler, and M. Cohen. Unstructured lumigraph rendering. In Proceedings of the 28th annual conference on Computer graphics and interactive techniques, pp. 425–432, 2001.
  • [5] G. Chaurasia, S. Duchene, O. Sorkine-Hornung, and G. Drettakis. Depth synthesis and local warps for plausible image-based navigation. ACM Transactions on Graphics (TOG), 32(3):1–12, 2013.
  • [6] A. Chen, Z. Xu, F. Zhao, X. Zhang, F. Xiang, J. Yu, and H. Su. Mvsnerf: Fast generalizable radiance field reconstruction from multi-view stereo. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 14124–14133, 2021.
  • [7] D. Chen, Y. Liu, L. Huang, B. Wang, and P. Pan. Geoaug: Data augmentation for few-shot nerf with geometry constraints. In European Conference on Computer Vision, pp. 322–337. Springer, 2022.
  • [8] J. Chibane, A. Bansal, V. Lazova, and G. Pons-Moll. Stereo radiance fields (srf): Learning view synthesis for sparse views of novel scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7911–7920, 2021.
  • [9] P. E. Debevec, C. J. Taylor, and J. Malik. Modeling and rendering architecture from photographs: A hybrid geometry-and image-based approach. In Proceedings of the 23rd annual conference on Computer graphics and interactive techniques, pp. 11–20, 1996.
  • [10] K. Deng, A. Liu, J.-Y. Zhu, and D. Ramanan. Depth-supervised nerf: Fewer views and faster training for free. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12882–12891, 2022.
  • [11] J. Flynn, M. Broxton, P. Debevec, M. DuVall, G. Fyffe, R. Overbeck, N. Snavely, and R. Tucker. Deepview: View synthesis with learned gradient descent. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2367–2376, 2019.
  • [12] J. Flynn, I. Neulander, J. Philbin, and N. Snavely. Deepstereo: Learning to predict new views from the world’s imagery. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5515–5524, 2016.
  • [13] S. Fridovich-Keil, A. Yu, M. Tancik, Q. Chen, B. Recht, and A. Kanazawa. Plenoxels: Radiance fields without neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5501–5510, 2022.
  • [14] Y. Han, R. Wang, and J. Yang. Single-view view synthesis in the wild with learned adaptive multiplane images. In ACM SIGGRAPH 2022 Conference Proceedings, pp. 1–8, 2022.
  • [15] Y.-H. Huang, Y. He, Y.-J. Yuan, Y.-K. Lai, and L. Gao. Stylizednerf: consistent 3d scene stylization as stylized nerf via 2d-3d mutual learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18342–18352, 2022.
  • [16] A. Jain, M. Tancik, and P. Abbeel. Putting nerf on a diet: Semantically consistent few-shot view synthesis. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5885–5894, 2021.
  • [17] M. Kim, S. Seo, and B. Han. Infonerf: Ray entropy minimization for few-shot neural volume rendering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12912–12921, 2022.
  • [18] M. Kwak, J. Song, and S. Kim. Geconerf: Few-shot neural radiance fields via geometric consistency. arXiv preprint arXiv:2301.10941, 2023.
  • [19] H.-A. Le, T. Mensink, P. Das, and T. Gevers. Novel view synthesis from single images via point cloud transformation. arXiv preprint arXiv:2009.08321, 2020.
  • [20] M. Levoy and P. Hanrahan. Light field rendering. In Proceedings of the 23rd annual conference on Computer graphics and interactive techniques, pp. 31–42, 1996.
  • [21] J. Li, Z. Feng, Q. She, H. Ding, C. Wang, and G. H. Lee. Mine: Towards continuous depth mpi with nerf for novel view synthesis. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 12578–12588, 2021.
  • [22] Z. Li, W. Xian, A. Davis, and N. Snavely. Crowdsampling the plenoptic function. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I 16, pp. 178–196. Springer, 2020.
  • [23] Y. Liu, S. Peng, L. Liu, Q. Wang, P. Wang, C. Theobalt, X. Zhou, and W. Wang. Neural rays for occlusion-aware image-based rendering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7824–7833, 2022.
  • [24] B. Mildenhall, P. P. Srinivasan, R. Ortiz-Cayon, N. K. Kalantari, R. Ramamoorthi, R. Ng, and A. Kar. Local light field fusion: Practical view synthesis with prescriptive sampling guidelines. ACM Transactions on Graphics (TOG), 38(4):1–14, 2019.
  • [25] B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng. Nerf: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM, 65(1):99–106, 2021.
  • [26] F. Mu, J. Wang, Y. Wu, and Y. Li. 3d photo stylization: Learning to generate stylized novel views from a single image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16273–16282, 2022.
  • [27] T. Müller, A. Evans, C. Schied, and A. Keller. Instant neural graphics primitives with a multiresolution hash encoding. ACM Transactions on Graphics (ToG), 41(4):1–15, 2022.
  • [28] M. Niemeyer, J. T. Barron, B. Mildenhall, M. S. Sajjadi, A. Geiger, and N. Radwan. Regnerf: Regularizing neural radiance fields for view synthesis from sparse inputs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5480–5490, 2022.
  • [29] K. Park, U. Sinha, J. T. Barron, S. Bouaziz, D. B. Goldman, S. M. Seitz, and R. Martin-Brualla. Nerfies: Deformable neural radiance fields. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5865–5874, 2021.
  • [30] A. Pumarola, E. Corona, G. Pons-Moll, and F. Moreno-Noguer. D-nerf: Neural radiance fields for dynamic scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10318–10327, 2021.
  • [31] C. Rasmussen and Z. Ghahramani. Occam’s razor. Advances in neural information processing systems, 13, 2000.
  • [32] G. Riegler and V. Koltun. Stable view synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12216–12225, 2021.
  • [33] B. Roessle, J. T. Barron, B. Mildenhall, P. P. Srinivasan, and M. Nießner. Dense depth priors for neural radiance fields from sparse input views. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12892–12901, 2022.
  • [34] J. L. Schonberger and J.-M. Frahm. Structure-from-motion revisited. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4104–4113, 2016.
  • [35] S. Seo, Y. Chang, and N. Kwak. Flipnerf: Flipped reflection rays for few-shot novel view synthesis. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 22883–22893, 2023.
  • [36] S. Seo, D. Han, Y. Chang, and N. Kwak. Mixnerf: Modeling a ray with mixture density for novel view synthesis from sparse inputs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 20659–20668, 2023.
  • [37] S. Sinha, D. Steedly, and R. Szeliski. Piecewise planar stereo for image-based rendering. In 2009 International Conference on Computer Vision, pp. 1881–1888, 2009.
  • [38] N. Somraj, A. Karanayil, and R. Soundararajan. Simplenerf: Regularizing sparse input neural radiance fields with simpler solutions. In SIGGRAPH Asia 2023 Conference Papers, pp. 1–11, 2023.
  • [39] Z. Song, W. Chen, D. Campbell, and H. Li. Deep novel view synthesis from colored 3d point clouds. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXIV 16, pp. 1–17. Springer, 2020.
  • [40] P. P. Srinivasan, T. Wang, A. Sreelal, R. Ramamoorthi, and R. Ng. Learning to synthesize a 4d rgbd light field from a single image. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2243–2251, 2017.
  • [41] A. Trevithick and B. Yang. Grf: Learning a general radiance field for 3d representation and rendering. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 15182–15192, 2021.
  • [42] P. Truong, M. Danelljan, L. Van Gool, and R. Timofte. Learning accurate dense correspondences and when to trust them. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5714–5724, 2021.
  • [43] P. Truong, M.-J. Rakotosaona, F. Manhardt, and F. Tombari. Sparf: Neural radiance fields from sparse and noisy poses. arXiv preprint arXiv:2211.11738, 2022.
  • [44] R. Tucker and N. Snavely. Single-view view synthesis with multiplane images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 551–560, 2020.
  • [45] C. Wang, R. Jiang, M. Chai, M. He, D. Chen, and J. Liao. Nerf-art: Text-driven neural radiance fields stylization. arXiv preprint arXiv:2212.08070, 2022.
  • [46] D. Wang, X. Cui, S. Salcudean, and Z. J. Wang. Generalizable neural radiance fields for novel view synthesis with transformer. arXiv preprint arXiv:2206.05375, 2022.
  • [47] Q. Wang, Z. Wang, K. Genova, P. P. Srinivasan, H. Zhou, J. T. Barron, R. Martin-Brualla, N. Snavely, and T. Funkhouser. Ibrnet: Learning multi-view image-based rendering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4690–4699, 2021.
  • [48] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, 13(4):600–612, 2004.
  • [49] O. Wiles, G. Gkioxari, R. Szeliski, and J. Johnson. Synsin: End-to-end view synthesis from a single image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7467–7477, 2020.
  • [50] S. Wizadwongsa, P. Phongthawee, J. Yenphraphai, and S. Suwajanakorn. Nex: Real-time view synthesis with neural basis expansion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8534–8543, 2021.
  • [51] D. N. Wood, D. I. Azuma, K. Aldinger, B. Curless, T. Duchamp, D. H. Salesin, and W. Stuetzle. Surface light fields for 3d photography. In Proceedings of the 27th annual conference on Computer graphics and interactive techniques, pp. 287–296, 2000.
  • [52] D. Xu, Y. Jiang, P. Wang, Z. Fan, H. Shi, and Z. Wang. Sinnerf: Training neural radiance fields on complex scenes from a single image. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXII, pp. 736–753. Springer, 2022.
  • [53] Q. Xu, Z. Xu, J. Philip, S. Bi, Z. Shu, K. Sunkavalli, and U. Neumann. Point-nerf: Point-based neural radiance fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5438–5448, 2022.
  • [54] J. Yang, M. Pavone, and Y. Wang. Freenerf: Improving few-shot neural rendering with free frequency regularization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8254–8263, 2023.
  • [55] M. You, M. Guo, X. Lyu, H. Liu, and J. Hou. Learning a unified 3d point cloud for view synthesis. arXiv preprint arXiv:2209.05013, 2022.
  • [56] A. Yu, R. Li, M. Tancik, H. Li, R. Ng, and A. Kanazawa. Plenoctrees for real-time rendering of neural radiance fields. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5752–5761, 2021.
  • [57] A. Yu, V. Ye, M. Tancik, and A. Kanazawa. pixelnerf: Neural radiance fields from one or few images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4578–4587, 2021.
  • [58] C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals. Understanding deep learning (still) requires rethinking generalization. Communications of the ACM, 64(3):107–115, 2021.
  • [59] K. Zhang, G. Riegler, N. Snavely, and V. Koltun. Nerf++: Analyzing and improving neural radiance fields. arXiv preprint arXiv:2010.07492, 2020.
  • [60] R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 586–595, 2018.
  • [61] X. Zhao, F. Ma, D. Güera, Z. Ren, A. G. Schwing, and A. Colburn. Generative multiplane images: Making a 2d gan 3d-aware. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part V, pp. 18–35. Springer, 2022.
  • [62] T. Zhou, R. Tucker, J. Flynn, G. Fyffe, and N. Snavely. Stereo magnification: learning view synthesis using multiplane images. ACM Transactions on Graphics (TOG), 37(4):1–12, 2018.
  • [63] T. Zhou, S. Tulsiani, W. Sun, J. Malik, and A. A. Efros. View synthesis by appearance flow. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part IV 14, pp. 286–301. Springer, 2016.