(cvpr) Package cvpr Warning: Incorrect paper size - CVPR uses paper size ‘letter’. Please load document class ‘article’ with ‘letterpaper’ option

Sparse Views, Near Light:
A Practical Paradigm for Uncalibrated Point-light Photometric Stereo

Mohammed Brahimi1,2    Bjoern Haefner1,2,4****The contribution was done while at TUM.    Zhenzhang Ye1,2    Bastian Goldluecke3    Daniel Cremers1,2
1 TU Munich, 2 Munich Center for Machine Learning, 3 University of Konstanz, 4 NVIDIA
{{\{{mohammed.brahimi, bjoern.haefner, yez, cremers}}\}}@tum.de
[email protected]
Abstract

Neural approaches have shown a significant progress on camera-based reconstruction. But they require either a fairly dense sampling of the viewing sphere, or pre-training on an existing dataset, thereby limiting their generalizability. In contrast, photometric stereo (PS) approaches have shown great potential for achieving high-quality reconstruction under sparse viewpoints. Yet, they are impractical because they typically require tedious laboratory conditions, are restricted to dark rooms, and often multi-staged, making them subject to accumulated errors. To address these shortcomings, we propose an end-to-end uncalibrated multi-view PS framework for reconstructing high-resolution shapes acquired from sparse viewpoints in a real-world environment. We relax the dark room assumption, and allow a combination of static ambient lighting and dynamic near LED lighting, thereby enabling easy data capture outside the lab. Experimental validation confirms that it outperforms existing baseline approaches in the regime of sparse viewpoints by a large margin. This allows to bring high-accuracy 3D reconstruction from the dark room to the real world, while maintaining a reasonable data capture complexity.

Abstract

In this supplementary material, we show further details about our framework. Specifically, we describe the network architecture with all its parameters and training specifications. Then we elaborate on the capturing process to retrieve the synthetic and real-world photometric images. After that, we show additional results on three viewpoints with a small camera baseline, as well as error maps of the reconstructions presented in the main paper. Furthermore, we show additional reconstruction results on both captured scans and the multiview diligent dataset, as well as some relighting results. We also analyze the effect of the ratio of point light intensity on the reconstruction quality, as well as the effect of the number of viewpoints and lights. Finally, we elaborate on the limitations of our approach.

1 Introduction

Refer to caption Refer to caption Refer to caption Refer to caption
Multi-illumination data at viewpoint 1 Multi-illumination data at viewpoint 2 PS-NeRF [66] Ours
Figure 1: We introduce the first framework for multi-view uncalibrated point-light photometric stereo. Given a set of PS images captured from different viewpoints (left), our method recovers high-fidelity 3333D reconstruction (right). The acquisition of uncalibrated point-light PS imagery captured under ambient lighting in a sparse multi-view setup does not only allow for easy data capture, but also leads to 3333D reconstructions of unprecedented detail. Here, with as few as two views we are able to reconstruct the squirrel’s 3333D geometry at higher precision than the state-of-the-art.

The challenge of 3333D reconstruction stands as a cornerstone in both computer vision and computer graphics. Despite notable progress in recovering an object’s shape from dense image viewpoints, predicting consistent geometry from sparse viewpoints remains a difficult task. Contemporary approaches employing neural data structures depend heavily on extensive training data to achieve generalization in the context of sparse views. Additionally, the presence of a wide baseline or textureless objects form significant obstacles. In contrast, photometric methodologies like photometric stereo (PS) excel in reconstructing geometry, even in textureless regions. This capability is attributed to the abundance of shading information derived from images acquired under diverse illumination. Yet, such approaches typically require a controlled laboratory setup to fulfill the necessary dark room and directional light assumptions. As a consequence, PS approaches become impractical beyond the confines of a laboratory.

To address these shortcomings we combine state-of-the-art volume rendering formulations with a sparse multi-view photometric stereo model. In particular, we advocate a physically realistic lighting model that combines ambient light and uncalibrated point-light illumination. Our approach facilitates simplified data acquisition, and we introduce a novel pipeline capable of reconstructing an object’s geometry from a sparse set of viewpoints, even if the object is completely textureless. Furthermore, since we assume a point-light model instead of distant directional lighting, we can infer absolute depth from a single viewpoint, allowing us to address the challenge of wide camera baselines.

In detail, our contributions are as follows:

  • We introduce the first framework for multi-view uncalibrated point-light photometric stereo. It combines a state-of-the-art volume rendering formulation with a physically realistic model of ambient light and point lights.

  • We eliminate the need for a dark room environment, dense capturing process, and distant lighting. Thereby we enhance the accessibility and simplify data acquisition for setups beyond traditional laboratory settings.

  • We validate that it can be successfully used for accurate shape reconstruction of textureless objects in highly sparse scenarios with wide baselines.

  • Despite the absence of pre-processing or vast training data, we outperform cutting-edge approaches that either rely only on static ambient illumination or PS imagery.

2 Related Work

In this section, we categorize pertinent works into two distinct groups. The first category focuses on neural 3333D reconstruction, emphasizing the use of images captured under static ambient illumination. Subsequently, the second category delves into photometric stereo (PS) approaches, which specifically involve methods assuming images captured under varying illumination conditions.

2.1 Neural 3D reconstruction

In our context, the domain of neural 3D reconstruction under static ambient light can be bifurcated into two distinct subcategories. The first comprises methodologies that presume a dense sampling of the viewing sphere [42, 69, 70, 58, 39, 43, 15, 74, 13, 5, 11]. Conversely, the second subcategory encompasses techniques adept at handling sparse camera viewpoints [67, 9, 17, 65, 73, 14, 59, 64, 68, 51, 71, 63, 35, 47, 23, 62, 72, 57, 7].

Given our emphasis on a sparse set of camera viewpoints, our focus will be directed toward the latter set of methodologies. The majority of sparse reconstruction approaches within this category hinge on training with a specific dataset [67, 9, 17, 65, 73, 14, 59, 64, 68, 51, 71, 63]. The reliance on extensive training data renders these techniques susceptible to errors when confronted with objects not represented in the training dataset, thereby introducing sensitivity to the domain gap between training and test data.

Efforts have been undertaken to mitigate the challenges associated with the domain gap. One approach involves fine-tuning pre-trained architectures during test time with sparse input data [35, 47, 23]. However, while fine-tuning enables fine-scale refinement of the predicted reconstruction, the broader issue of generalization to entirely different datasets persists as a challenge. Furthermore, it is crucial to note that all the previously mentioned sparse approaches presume small camera baselines and textured objects. To achieve a full reconstruction of an object from sparse views, it becomes imperative to develop texture-agnostic approaches capable of handling wide baselines.

Another set of approaches involves the utilization of pre-training to establish a geometric prior [62, 72, 57]. Yu et al. [72] leverages this geometric prior to predict a depth and normal map, albeit with a limitation in fine-scale details due to its coarse resolution. Notably, [62, 57] address this limitation. Wu et al. [62] relax the assumption on small baselines, although a reasonable overlap of sparse views remains necessary for consistent 3D reconstruction. Vora et al. [57] demonstrate promising results with much less overlap, employing only three viewpoints. However, the use of template shapes as a prior term introduces the risk of the reconstruction quality being affected by the domain gap curse.

In contrast, our approach eliminates the requirement for training data and operates effectively in wide baseline scenarios, independent of the object’s texture. This accomplishment can be attributed to the integration of PS principles, which we will elucidate in the subsequent discussion.

2.2 Photometric Stereo

The principle of PS [60] has a long and established history, renowned for yielding high-quality and finely detailed geometric estimates, due to its capturing process: multiple images under different illumination of a static object are captured from a single camera viewpoint.

In our case, PS approaches can thus be categorized into two main types. The first category pertains to single-view methods [55, 24, 44, 49, 32, 38, 50, 31, 34, 18], wherein data is acquired as described above. The second category involves multi-view methods [30, 78, 1, 10, 33, 22, 61, 45, 28, 27, 66], where the capturing process is repeated at multiple camera viewpoints.

Within the context of our task, we concentrate on the multi-view scenario. Many existing approaches in this domain assume knowledge of the light source for each image, earning them the designation of calibrated PS (CPS). In this case, frequently the light is parameterized as directional [30, 78, 1, 10], while approaches employing point-light illumination are more sparsely represented [30, 33, 10], owing to the intricate model and the much more complex calibration process [50]. In a broader context, the calibration of each image’s light source is impractical and tedious, underscoring the advantage of the uncalibrated case, which inherently facilitates a more straightforward and streamlined capturing process.

Efforts have been dedicated to exploring the uncalibrated PS (UPS) scenario, where there is no prior knowledge of the light source [22, 61, 45, 28, 27, 66]. Nevertheless, all these uncalibrated methods typically assume a distant and directional light model or permit a lighting model based on spherical harmonics [2, 4, 48, 19, 20, 53, 52], albeit at the expense of being restricted to Lambertian objects [61]. Consequently, the exploration of the uncalibrated point-light case remains uncharted territory, particularly in the multi-view scenario.

Despite the impractical assumption of calibrated light or the constraint of uncalibrated directional lighting, the aforementioned multi-view photometric stereo approaches demand a dark room and lack accommodation for ambient light [30, 78, 1, 10, 33, 22, 45, 28, 27, 66]. Additionally, they comprise multiple stages, rendering them prone to initialization and accumulation errors [30, 33, 22, 61, 45, 28, 27, 66]. Furthermore, these methods assume Lambertian objects [33, 61] and crucially rely on a dense sampling of camera viewpoints around the object [30, 33, 22, 61, 45].

The approach most closely related to ours is the sparse method PS-NeRF [66], which assumes uncalibrated directional lighting with imagery captured in a dark room. This method comprises multiple stages, where initially, a pre-trained network [8] is utilized to estimate the initial light, along with normal information for each image and camera view. Subsequently, this acquired information is employed to estimate the object’s mesh through the optimization of an occupancy field [43]. The outputs of both stages are then combined to derive a consistent normal map of the object. Although PS-NeRF demonstrates impressive results in a sparse scenario with five viewpoints, its susceptibility to error accumulation is a potential limitation, given the involvement of multiple stages.

Our proposed model addresses the aforementioned limitations comprehensively. Notably, we excel in handling sparse yet wide baseline scenarios, where existing state-of-the-art methods encounter challenges. Our investigation tackles the open challenge of multi-view uncalibrated point-light PS within an end-to-end framework, liberating us from the constraints associated with Lambertian objects. Furthermore, our approach facilitates easy data capture in the presence of ambient light, eliminating the need for a dark room.

3 Setting and image formation model

We consider an object under static illumination photographed from several different viewpoints. For every viewpoint, we are given a set of N+1𝑁1N+1italic_N + 1 images (Il)l=0,,Nsubscriptsuperscript𝐼𝑙𝑙0𝑁(I^{l})_{l=0,\dots,N}( italic_I start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_l = 0 , … , italic_N end_POSTSUBSCRIPT. Image I0superscript𝐼0I^{0}italic_I start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT is captured only under static ambient illumination and called the ambient image. For images I1superscript𝐼1I^{1}italic_I start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT to INsuperscript𝐼𝑁I^{N}italic_I start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT, the object is in addition to the ambient illumination also illuminated by a single achromatic point-light source with scalar light intensity L0+subscript𝐿0superscriptL_{0}\in\mathbb{R}^{+}italic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT, placed at a different location 𝐩l3superscript𝐩𝑙superscript3\mathbf{p}^{l}\in\mathbb{R}^{3}bold_p start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT for each image and viewpoint. We use the common assumption in volume rendering [70, 62, 39, 58] that the image brightness Ilsuperscript𝐼𝑙I^{l}italic_I start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT is the same as the accumulated radiance of the volume rendering, see Eq. 4. Note that we do not add an index for the viewpoint to the image to not clutter notation, whenever we later evaluate an image at a pixel location, we assume it implicitly means a pixel at a specific viewpoint. In particular, a sum over all pixels is the sum over all pixels at every viewpoint location.

According to the linearity of the radiance, the total radiance for the image l𝑙litalic_l is given by a sum l=amb+ptlsuperscript𝑙subscriptambsuperscriptsubscriptpt𝑙\mathcal{L}^{l}=\mathcal{L}_{\text{amb}}+\mathcal{L}_{\text{pt}}^{l}caligraphic_L start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT = caligraphic_L start_POSTSUBSCRIPT amb end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT pt end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT, where ambsubscriptamb\mathcal{L}_{\text{amb}}caligraphic_L start_POSTSUBSCRIPT amb end_POSTSUBSCRIPT is the radiance due to the static illumination, the same for each image, and ptlsuperscriptsubscriptpt𝑙\mathcal{L}_{\text{pt}}^{l}caligraphic_L start_POSTSUBSCRIPT pt end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT is the radiance due to the point-light source used for that particular image. We will first express the radiance due to the point-light source with respect to the material and lighting parameters, which allows to take into account the different illuminations of the images. Subsequently, we elaborate on how we deal with the ambient component.

3.1 Point-light illumination and material

The field L(𝐱,𝐧,𝐯)𝐿𝐱𝐧𝐯L(\mathbf{x},\mathbf{n},\mathbf{v})italic_L ( bold_x , bold_n , bold_v ) maps points 𝐱3𝐱superscript3\mathbf{x}\in\mathbb{R}^{3}bold_x ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT, normal direction 𝐧(𝐱)𝕊2𝐧𝐱superscript𝕊2\mathbf{n}(\mathbf{x})\in\mathbb{S}^{2}bold_n ( bold_x ) ∈ blackboard_S start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT and viewing direction 𝐯𝕊2𝐯superscript𝕊2\mathbf{v}\in\mathbb{S}^{2}bold_v ∈ blackboard_S start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT to its radiance. It can be expressed in terms of geometry, material and lighting with the classical rendering equation [25]. Since for now we only consider a single achromatic point-light source with intensity L0subscript𝐿0L_{0}italic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT at position 𝐩𝐩\mathbf{p}bold_p, this equation simplifies to

L(𝐱,𝐧,𝐯)=L0𝐩𝐱2fr(𝐱,𝐧,𝐯,𝐥)max(0,𝐧𝐥),𝐿𝐱𝐧𝐯subscript𝐿0superscriptdelimited-∥∥𝐩𝐱2subscript𝑓r𝐱𝐧𝐯𝐥0𝐧𝐥L(\mathbf{x},\mathbf{n},\mathbf{v})=\dfrac{L_{0}}{\left\lVert\mathbf{p}-% \mathbf{x}\right\rVert^{2}}f_{\text{r}}(\mathbf{x},\mathbf{n},\mathbf{v},% \mathbf{l})\max(0,\mathbf{n}\cdot\mathbf{l}),italic_L ( bold_x , bold_n , bold_v ) = divide start_ARG italic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG start_ARG ∥ bold_p - bold_x ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG italic_f start_POSTSUBSCRIPT r end_POSTSUBSCRIPT ( bold_x , bold_n , bold_v , bold_l ) roman_max ( 0 , bold_n ⋅ bold_l ) , (1)

where 𝐥=𝐩𝐱𝐩𝐱𝐥𝐩𝐱delimited-∥∥𝐩𝐱\mathbf{l}=\frac{\mathbf{p}-\mathbf{x}}{\left\lVert\mathbf{p}-\mathbf{x}\right\rVert}bold_l = divide start_ARG bold_p - bold_x end_ARG start_ARG ∥ bold_p - bold_x ∥ end_ARG.

As a model for the spatially varying reflectance frsubscript𝑓rf_{\text{r}}italic_f start_POSTSUBSCRIPT r end_POSTSUBSCRIPT, we use the simplified Disney BRDF [26]. It is relatively simple yet expressive enough to resemble a wide variety of materials. It has already been successfully used in several other prior works [36, 75, 5]. The Disney BRDF requires three parameters to be available at every point 𝐱3𝐱superscript3\mathbf{x}\in\mathbb{R}^{3}bold_x ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT: An RGB diffuse albedo ρ03𝜌subscriptsuperscript3absent0\rho\in\mathbb{R}^{3}_{\geq 0}italic_ρ ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ≥ 0 end_POSTSUBSCRIPT, a roughness αr>0subscript𝛼r0\alpha_{\text{r}}>0italic_α start_POSTSUBSCRIPT r end_POSTSUBSCRIPT > 0 and a specular albedo αs[0,1]subscript𝛼s01\alpha_{\text{s}}\in[0,1]italic_α start_POSTSUBSCRIPT s end_POSTSUBSCRIPT ∈ [ 0 , 1 ].

As in [5], all three BRDF parameters can be represented with MLPs. One MLP with network parameters γ1subscript𝛾1\gamma_{1}italic_γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT computes the diffuse parameter ρ𝜌\rhoitalic_ρ, a second MLP with network parameters γ2subscript𝛾2\gamma_{2}italic_γ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT computes the specular parameters (αr,αs)subscript𝛼rsubscript𝛼s(\alpha_{\text{r}},\alpha_{\text{s}})( italic_α start_POSTSUBSCRIPT r end_POSTSUBSCRIPT , italic_α start_POSTSUBSCRIPT s end_POSTSUBSCRIPT ). To encompass all point-light and BRDF parameters, we concatenate them into the vectors ϕ=(𝐩1,,𝐩M)italic-ϕsuperscript𝐩1superscript𝐩𝑀\phi=(\mathbf{p}^{1},\dots,\mathbf{p}^{M})italic_ϕ = ( bold_p start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , … , bold_p start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ), which includes all light source positions (where M𝑀Mitalic_M is the product of N𝑁Nitalic_N with the number of viewpoints), and γ=(γ1,γ2)𝛾subscript𝛾1subscript𝛾2\gamma=(\gamma_{1},\gamma_{2})italic_γ = ( italic_γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_γ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ), representing the BRDF parameters. Both ϕitalic-ϕ\phiitalic_ϕ and γ𝛾\gammaitalic_γ serve as the network parameters to be learned during optimization. However, note that one of our contributions is a novel strategy to deal with the diffuse albedo later on in Sec. 5.2, which dramatically improves performance in extremely sparse scenarios.

We assume that the intensity L0subscript𝐿0L_{0}italic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT remains constant across all images and views. This assumption is reasonable, e.g., when using a smartphone’s flashlight or a sole LED, as shown in our experiments. A constant intensity introduces a scale ambiguity between the frsubscript𝑓rf_{\text{r}}italic_f start_POSTSUBSCRIPT r end_POSTSUBSCRIPT and L0subscript𝐿0L_{0}italic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. We set L01subscript𝐿01L_{0}\equiv 1italic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ≡ 1 and subsequently estimate a scaled version of the BRDF, i.e., the intensity is absorbed into the BRDF. This is feasible, since we can scale the Disney BRDF arbitrarily.

3.2 Ambient light

Given the provided ambient image, we adopt the approach from [77], which involves subtracting the ambient image from all other images. This process effectively eliminates ambient radiance from our considerations, simulating a traditional dark room scenario. A drawback is that in practice, the captured images contain additive noise, hence such a subtraction will double the variance of the noise distribution, resulting in noisier input images. Nevertheless, as can be seen in the results, our approach works successfully with this strategy even if we use a simple smartphone camera for the data capture.

However, this simplification is enabled by the fact that the point-light sources are near the object, making it easy to obtain a well exposed contribution from the point-light source. In methods assuming directional lighting, the practical simulation often involves using a point-light source placed far from the object, ensuring that the object’s diameter is significantly smaller than the distance between the point-light and the object. For example, for the DiLiGenT-MV dataset [30], the light sources are roughly 1m from the object, whereas for our setup, the distance is approximately 25cm. As a consequence, for a given static ambient illumination, the distant point-light will need to be 16×16\times16 × brighter in the Diligent setup to obtain a similar signal to noise ratio. This increased brightness requirement escalates the hardware demands, especially if the highest quality is sought.

3.3 Other illumination effects

We do not model indirect lighting and cast-shadows explicitly. However, multiple previous works [3, 21, 36, 41, 76] and PS frameworks [50, 18, 30, 10] have demonstrated that satisfactory results can be achieved without taking these into account. In addition, we use the robust L1superscript𝐿1L^{1}italic_L start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT-norm in our data term (7) to further increase robustness. Furthermore, we will expose a simple strategy in Sec. 5 to reduce the impact of cast-shadows, which inevitably occur when capturing a nonconvex object under changing illumination.

4 Volume rendering of the surface

For differentiable rendering we build upon VolSDF [70] because in contrast to NeRF [39] it allows us to decouple geometric representation and appearance. As discussed in [62, 71, 51], VolSDF exhibits poor reconstruction quality in the sparse viewpoint scenario due to the shape-radiance ambiguity. We eliminate this ambiguity thanks to photometric stereo, providing shading cues from different illuminations for each viewpoint. To this end, we employ the radiance field in Eq. 1 and apply it to the multi-view uncalibrated point-light PS setting. The integration of differentiable volume rendering with a physically realistic lighting model provides an efficient solution to the multi-view photometric stereo problem, enabling high-quality reconstruction even with sparse viewpoints, as demonstrated in the subsequent results.

Geometry model and associated density. The geometry of the scene is modeled with a signed distance function (SDF) d:Ω:𝑑Ωd:\Omega\to\mathbb{R}italic_d : roman_Ω → blackboard_R within the volume Ω3Ωsuperscript3\Omega\subset\mathbb{R}^{3}roman_Ω ⊂ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT. We follow [46, 69, 58, 70] and represent the SDF with an MLP d(𝐱;θ)𝑑𝐱𝜃d(\mathbf{x};\theta)italic_d ( bold_x ; italic_θ ) parametrized by θ𝜃\thetaitalic_θ. Note that we assume positive values of the SDF d𝑑ditalic_d inside the object, thus the normal vector is given by 𝐧=d/d𝐧𝑑delimited-∥∥𝑑\mathbf{n}=\nabla d/\left\lVert\nabla d\right\rVertbold_n = ∇ italic_d / ∥ ∇ italic_d ∥ on ΩΩ\Omegaroman_Ω. To render the scene, we follow [70] and transform the SDF into a density σ:Ω0:𝜎Ωsubscriptabsent0\sigma:\Omega\to\mathbb{R}_{\geq 0}italic_σ : roman_Ω → blackboard_R start_POSTSUBSCRIPT ≥ 0 end_POSTSUBSCRIPT by means of the transformation

σ(𝐱)=β1Ψβ(d(𝐱)),𝜎𝐱superscript𝛽1subscriptΨ𝛽𝑑𝐱\sigma(\mathbf{x})=\beta^{-1}\Psi_{\beta}(-d(\mathbf{x})),italic_σ ( bold_x ) = italic_β start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT roman_Ψ start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT ( - italic_d ( bold_x ) ) , (2)

where ΨβsubscriptΨ𝛽\Psi_{\beta}roman_Ψ start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT denotes the cumulative distribution function of the Laplace distribution with zero mean. The positive scale parameter β𝛽\betaitalic_β is learned during optimization. For simplicity, we assume it is included in the set of SDF parameters θ𝜃\thetaitalic_θ.

Volume Rendering. Using the density (2), we can now set up a distribution of weights w(t)𝑤𝑡w(t)italic_w ( italic_t ) along each ray 𝐱(t)=𝐜+t𝐯,t0formulae-sequence𝐱𝑡𝐜𝑡𝐯𝑡0\mathbf{x}(t)=\mathbf{c}+t\mathbf{v},t\geq 0bold_x ( italic_t ) = bold_c + italic_t bold_v , italic_t ≥ 0 defined by the center of projection 𝐜𝐜\mathbf{c}bold_c of the camera and viewing direction 𝐯𝐯\mathbf{v}bold_v,

w(t)=σ(𝐱(t))exp(0tσ(𝐱(s))ds).𝑤𝑡𝜎𝐱𝑡superscriptsubscript0𝑡𝜎𝐱𝑠differential-d𝑠w(t)=\sigma(\mathbf{x}(t))\,\exp\left(-\int_{0}^{t}\sigma(\mathbf{x}(s))\;% \mathrm{d}s\right).italic_w ( italic_t ) = italic_σ ( bold_x ( italic_t ) ) roman_exp ( - ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_σ ( bold_x ( italic_s ) ) roman_d italic_s ) . (3)

This is the derivative of the opacity function, which is monotonically increasing from zero to one, and thus w𝑤witalic_w is indeed a probability distribution [70]. In particular, the integral

p=0w(t)L(𝐱(t),𝐧(t),𝐯)dt,subscript𝑝superscriptsubscript0𝑤𝑡𝐿𝐱𝑡𝐧𝑡𝐯differential-d𝑡\mathcal{L}_{p}=\int_{0}^{\infty}w(t)L(\mathbf{x}(t),\mathbf{n}(t),\mathbf{v})% \;\mathrm{d}t,caligraphic_L start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT = ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_w ( italic_t ) italic_L ( bold_x ( italic_t ) , bold_n ( italic_t ) , bold_v ) roman_d italic_t , (4)

to compute radiance psubscript𝑝\mathcal{L}_{p}caligraphic_L start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT in the pixel p2𝑝superscript2p\in\mathbb{R}^{2}italic_p ∈ blackboard_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT corresponding to the ray 𝐱(t)𝐱𝑡\mathbf{x}(t)bold_x ( italic_t ) is well-defined. Intuitively, we accumulate the radiance field (1) of the point-light source at locations where opacity increases, i.e. where we move towards the inside of the object. For an ideal sharp boundary, w𝑤witalic_w would be a Dirac distribution with its peak placed at the boundary.

As was done in [70] we approximate the integral (4) using the quadrature rule

pi=1m1(ti+1ti)w(ti)L(𝐱(ti),𝐧(ti),𝐯).subscript𝑝subscriptsuperscript𝑚1𝑖1subscript𝑡𝑖1subscript𝑡𝑖𝑤subscript𝑡𝑖𝐿𝐱subscript𝑡𝑖𝐧subscript𝑡𝑖𝐯\mathcal{L}_{p}\approx\sum^{m-1}_{i=1}(t_{i+1}-t_{i})w(t_{i})L(\mathbf{x}(t_{i% }),\mathbf{n}(t_{i}),\mathbf{v}).caligraphic_L start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ≈ ∑ start_POSTSUPERSCRIPT italic_m - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT ( italic_t start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT - italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) italic_w ( italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) italic_L ( bold_x ( italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , bold_n ( italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , bold_v ) . (5)

for numerical integration with a discrete set of samples t1<t2<<tmsubscript𝑡1subscript𝑡2subscript𝑡𝑚t_{1}<t_{2}<\dots<t_{m}italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT < italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT < ⋯ < italic_t start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT. Note that the integral (3) required for w(ti)𝑤subscript𝑡𝑖w(t_{i})italic_w ( italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) in (5) can be numerically computed using an analogous formula while iteratively accumulating the sum (5).

5 Training objective and optimization

In order to reconstruct the scene by inverse rendering, we train our neural networks from the available input images. We optimize for the geometry parameters θ𝜃\thetaitalic_θ of the SDF network and point-light and BRDF parameters ϕitalic-ϕ\phiitalic_ϕ and γ𝛾\gammaitalic_γ of the appearance network, respectively.

5.1 Vanilla loss function with MLP for albedo

The overall loss consists of the sum of an inverse rendering loss ERGB(θ,ϕ,γ)subscript𝐸RGB𝜃italic-ϕ𝛾E_{\text{RGB}}(\theta,\phi,\gamma)italic_E start_POSTSUBSCRIPT RGB end_POSTSUBSCRIPT ( italic_θ , italic_ϕ , italic_γ ), an Eikonal loss Eeik(θ)subscript𝐸eik𝜃E_{\text{eik}}(\theta)italic_E start_POSTSUBSCRIPT eik end_POSTSUBSCRIPT ( italic_θ ) and a mask loss Emask(θ)subscript𝐸mask𝜃E_{\text{mask}}(\theta)italic_E start_POSTSUBSCRIPT mask end_POSTSUBSCRIPT ( italic_θ ) for the SDF,

E(θ,ϕ,γ)=ERGB(θ,ϕ,γ)+λ1Eeik(θ)+λ2Emask(θ),𝐸𝜃italic-ϕ𝛾subscript𝐸RGB𝜃italic-ϕ𝛾subscript𝜆1subscript𝐸eik𝜃subscript𝜆2subscript𝐸mask𝜃E(\theta,\phi,\gamma)=E_{\text{RGB}}(\theta,\phi,\gamma)+\lambda_{1}E_{\text{% eik}}(\theta)+\lambda_{2}E_{\text{mask}}(\theta),italic_E ( italic_θ , italic_ϕ , italic_γ ) = italic_E start_POSTSUBSCRIPT RGB end_POSTSUBSCRIPT ( italic_θ , italic_ϕ , italic_γ ) + italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_E start_POSTSUBSCRIPT eik end_POSTSUBSCRIPT ( italic_θ ) + italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_E start_POSTSUBSCRIPT mask end_POSTSUBSCRIPT ( italic_θ ) , (6)

with respective weighting parameters λ1,λ20subscript𝜆1subscript𝜆20\lambda_{1},\lambda_{2}\geq 0italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≥ 0 given as hyperparameters. We will describe the different terms in detail in the following.

The rendering loss

ERGB(θ,ϕ,γ)=plBτ(p)Iplpl(θ,ϕ,γ)1subscript𝐸RGB𝜃italic-ϕ𝛾subscript𝑝subscript𝑙subscript𝐵𝜏𝑝subscriptdelimited-∥∥subscriptsuperscript𝐼𝑙𝑝subscriptsuperscript𝑙𝑝𝜃italic-ϕ𝛾1E_{\text{RGB}}(\theta,\phi,\gamma)=\sum_{p}\sum_{l\in B_{\tau}(p)}\left\lVert I% ^{l}_{p}-\mathcal{L}^{l}_{p}(\theta,\phi,\gamma)\right\rVert_{1}italic_E start_POSTSUBSCRIPT RGB end_POSTSUBSCRIPT ( italic_θ , italic_ϕ , italic_γ ) = ∑ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_l ∈ italic_B start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ( italic_p ) end_POSTSUBSCRIPT ∥ italic_I start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT - caligraphic_L start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( italic_θ , italic_ϕ , italic_γ ) ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT (7)

encourages that the rendered images resemble the input images. As a strategy to deal with cast-shadows, for each pixel p𝑝pitalic_p, we use only the τ𝜏\tauitalic_τ brightest views Bτ(p){1N}subscript𝐵𝜏𝑝1𝑁B_{\tau}(p)\subset\{1\ldots N\}italic_B start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ( italic_p ) ⊂ { 1 … italic_N }, where τ>0𝜏0\tau>0italic_τ > 0 is a hyper-parameter. This reduces the impact of cast-shadows, as a pixel will have the smallest possible brightness when it lies in shadow. Such a strategy is necessary since despite the use of an L1superscript𝐿1L^{1}italic_L start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT-norm to improve robustness against outliers, some regions with frequent cast-shadows would still exhibit small artefacts, as can be seen in Fig. 2. In order to avoid wasting meaningful information due to this heuristic, we set τ=N𝜏𝑁\tau=Nitalic_τ = italic_N for the first half of the iterations, and then set it to τ=34N𝜏34𝑁\tau=\lceil\frac{3}{4}N\rceilitalic_τ = ⌈ divide start_ARG 3 end_ARG start_ARG 4 end_ARG italic_N ⌉ for the second half for improved handling of cast-shadows.

The Eikonal term

Eeik(θ)=𝐱(𝐱d(𝐱;θ)1)2subscript𝐸eik𝜃subscript𝐱superscriptdelimited-∥∥subscript𝐱𝑑𝐱𝜃12E_{\text{eik}}(\theta)=\sum_{\mathbf{x}}(\left\lVert\nabla_{\mathbf{x}}d(% \mathbf{x};\theta)\right\rVert-1)^{2}italic_E start_POSTSUBSCRIPT eik end_POSTSUBSCRIPT ( italic_θ ) = ∑ start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT ( ∥ ∇ start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT italic_d ( bold_x ; italic_θ ) ∥ - 1 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (8)

encourages d(𝐱;θ)𝑑𝐱𝜃d(\mathbf{x};\theta)italic_d ( bold_x ; italic_θ ) to approximate an SDF, as it penalizes deviations from the respective partial differential equation which every SDF satisfies, see e.g. [16].

Finally, we follow [5, 58] and employ the mask loss

Emask(θ)=pBCE(Mp,Wp(θ)),subscript𝐸mask𝜃subscript𝑝BCEsubscript𝑀𝑝subscript𝑊𝑝𝜃E_{\text{mask}}(\theta)=\sum_{p}\text{BCE}(M_{p},W_{p}(\theta)),italic_E start_POSTSUBSCRIPT mask end_POSTSUBSCRIPT ( italic_θ ) = ∑ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT BCE ( italic_M start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT , italic_W start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( italic_θ ) ) , (9)

to impose silhouette consistency by means of the binary cross entropy (BCE) between the given binary mask Mpsubscript𝑀𝑝M_{p}italic_M start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT at the pixel p𝑝pitalic_p and the sum Wp(θ)=i=1m1w(ti;θ)subscript𝑊𝑝𝜃superscriptsubscript𝑖1𝑚1𝑤subscript𝑡𝑖𝜃W_{p}(\theta)=\sum_{i=1}^{m-1}w(t_{i};\theta)italic_W start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( italic_θ ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m - 1 end_POSTSUPERSCRIPT italic_w ( italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; italic_θ ) of the weights at the sampling locations tisubscript𝑡𝑖t_{i}italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT used in (5).

Refer to captionRefer to caption Refer to captionRefer to caption Refer to captionRefer to caption
Input image w/o truncation w/ truncation
Figure 2: Effect of cast-shadows on the result. The region behind the ear exhibits frequent cast-shadows in the input images (left), which induce artefacts in the estimated mesh (middle). Our truncation strategy (Sec. 5.1) successfully eliminates these (right).
Refer to captionRefer to caption Refer to captionRefer to caption Refer to captionRefer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to captionRefer to caption
Refer to captionRefer to caption Refer to captionRefer to caption Refer to captionRefer to caption Refer to captionRefer to caption Refer to captionRefer to caption Refer to captionRefer to caption
Refer to captionRefer to caption Refer to captionRefer to caption Refer to captionRefer to caption Refer to captionRefer to caption Refer to caption Refer to caption Refer to captionRefer to caption
Input image [62] [10] (directional light) [10] (point-light) [66] Ours
Figure 3: Full 3D reconstruction of one synthetic and two real objects from 6 viewpoints. While some of the baseline reconstructions (with directional or point-light models) exhibit texture-induced patterns, distortions or other artefacts, the proposed framework consistently yields artefact- and distortion-free reconstructions with finer geometric details. Additional results are shown in the supplementary.

5.2 Using the optimal diffuse albedo

In this section, we introduce a strategy to handle the diffuse albedo. This is inspired by [18], however in their single-view approach they are only able to handle the least-squares case, i.e. L2superscript𝐿2L^{2}italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, while we extend their formulation to objective functions of the form of an Lpsuperscript𝐿𝑝L^{p}italic_L start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT-norm, with p=1𝑝1p=1italic_p = 1 in our case, see Eq. 7. In lieu of employing an MLP for modeling the diffuse albedo, we opt for approximating an optimal solution considering the remaining parameters. This allows us to exclude the diffuse albedo from the optimization process. Our subsequent demonstration will underscore the significant benefits of this approach in the context of sparse views, leading to markedly improved final geometry. A key technical change is the domain of the diffuse parameter ρ𝜌\rhoitalic_ρ. So far, it has been a vector field defined in world space ΩΩ\Omegaroman_Ω, implemented with a neural network. However, in the following we only try to recover the projection of ρ𝜌\rhoitalic_ρ on the image planes. Thus, the input to ρ𝜌\rhoitalic_ρ will be a pixel, and the output is the diffuse albedo at the intersection between the pixel’s ray and the object surface. In particular, the set of BRDF parameters is reduced to γ=γ2𝛾subscript𝛾2\gamma=\gamma_{2}italic_γ = italic_γ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and now only includes specular parameters.

Reformulation of the data term. We will now derive an expression for the data term which allows to solve for diffuse albedo ρ𝜌\rhoitalic_ρ when given all of the other parameters. Note that the SVBRDF can be decomposed into a diffuse part and a specular part [26],

fr(𝐱,𝐧,𝐯,𝐥)=ρ(𝐱)π+fs(𝐱,𝐧,𝐯,𝐥),subscript𝑓r𝐱𝐧𝐯𝐥𝜌𝐱𝜋subscript𝑓s𝐱𝐧𝐯𝐥f_{\text{r}}(\mathbf{x},\mathbf{n},\mathbf{v},\mathbf{l})=\dfrac{\rho(\mathbf{% x})}{\pi}+f_{\text{s}}(\mathbf{x},\mathbf{n},\mathbf{v},\mathbf{l}),italic_f start_POSTSUBSCRIPT r end_POSTSUBSCRIPT ( bold_x , bold_n , bold_v , bold_l ) = divide start_ARG italic_ρ ( bold_x ) end_ARG start_ARG italic_π end_ARG + italic_f start_POSTSUBSCRIPT s end_POSTSUBSCRIPT ( bold_x , bold_n , bold_v , bold_l ) , (10)

where the specular part fssubscript𝑓sf_{\text{s}}italic_f start_POSTSUBSCRIPT s end_POSTSUBSCRIPT depends on both the roughness and specular albedo. If we denote the BRDF-independent factor of the radiance field (1) by

Sl(𝐱,𝐧)=L0𝐱𝐩l2max(0,𝐧𝐥),superscript𝑆𝑙𝐱𝐧subscript𝐿0superscriptdelimited-∥∥𝐱superscript𝐩𝑙20𝐧𝐥S^{l}(\mathbf{x},\mathbf{n})=\dfrac{L_{0}}{\left\lVert\mathbf{x}-\mathbf{p}^{l% }\right\rVert^{2}}\max(0,\mathbf{n}\cdot\mathbf{l}),italic_S start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( bold_x , bold_n ) = divide start_ARG italic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG start_ARG ∥ bold_x - bold_p start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG roman_max ( 0 , bold_n ⋅ bold_l ) , (11)

this implies that the total radiance field is the sum of diffuse radiance ρπSl𝜌𝜋superscript𝑆𝑙\frac{\rho}{\pi}S^{l}divide start_ARG italic_ρ end_ARG start_ARG italic_π end_ARG italic_S start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT and specular radiance Lsl=fsSlsuperscriptsubscript𝐿𝑠𝑙subscript𝑓ssuperscript𝑆𝑙L_{s}^{l}=f_{\text{s}}S^{l}italic_L start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT = italic_f start_POSTSUBSCRIPT s end_POSTSUBSCRIPT italic_S start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT.

Inserting this into the volume rendering equation (4), we can thus re-arrange the data term (7) as

ERGB(θ,ϕ,γ,ρ)subscript𝐸RGB𝜃italic-ϕ𝛾𝜌\displaystyle E_{\text{RGB}}(\theta,\phi,\gamma,\rho)italic_E start_POSTSUBSCRIPT RGB end_POSTSUBSCRIPT ( italic_θ , italic_ϕ , italic_γ , italic_ρ ) =plSτ(p)bpl(θ,ϕ,γ)ρpapl(θ,ϕ)1absentsubscript𝑝subscript𝑙subscript𝑆𝜏𝑝subscriptdelimited-∥∥subscriptsuperscript𝑏𝑙𝑝𝜃italic-ϕ𝛾subscript𝜌𝑝subscriptsuperscript𝑎𝑙𝑝𝜃italic-ϕ1\displaystyle=\sum_{p}\sum_{l\in S_{\tau}(p)}\left\lVert b^{l}_{p}(\theta,\phi% ,\gamma)-\rho_{p}a^{l}_{p}(\theta,\phi)\right\rVert_{1}= ∑ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_l ∈ italic_S start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ( italic_p ) end_POSTSUBSCRIPT ∥ italic_b start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( italic_θ , italic_ϕ , italic_γ ) - italic_ρ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( italic_θ , italic_ϕ ) ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT (12)
with apl(θ,ϕ)with subscriptsuperscript𝑎𝑙𝑝𝜃italic-ϕ\displaystyle\text{with }a^{l}_{p}(\theta,\phi)with italic_a start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( italic_θ , italic_ϕ ) =1π0wθ(t)Sl(𝐱(t),𝐧θ(t))dt,absent1𝜋superscriptsubscript0subscript𝑤𝜃𝑡superscript𝑆𝑙𝐱𝑡subscript𝐧𝜃𝑡differential-d𝑡\displaystyle=\frac{1}{\pi}\int_{0}^{\infty}w_{\theta}(t)S^{l}(\mathbf{x}(t),% \mathbf{n}_{\theta}(t))\;\mathrm{d}t,= divide start_ARG 1 end_ARG start_ARG italic_π end_ARG ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_t ) italic_S start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( bold_x ( italic_t ) , bold_n start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_t ) ) roman_d italic_t ,
bpl(θ,ϕ,γ)subscriptsuperscript𝑏𝑙𝑝𝜃italic-ϕ𝛾\displaystyle b^{l}_{p}(\theta,\phi,\gamma)italic_b start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( italic_θ , italic_ϕ , italic_γ ) =Ipl0wθ(t)Lsl(𝐱(t),𝐧θ(t),𝐯)dt.absentsubscriptsuperscript𝐼𝑙𝑝superscriptsubscript0subscript𝑤𝜃𝑡superscriptsubscript𝐿𝑠𝑙𝐱𝑡subscript𝐧𝜃𝑡𝐯differential-d𝑡\displaystyle=I^{l}_{p}-\int_{0}^{\infty}w_{\theta}(t)L_{s}^{l}(\mathbf{x}(t),% \mathbf{n}_{\theta}(t),\mathbf{v})\;\mathrm{d}t.= italic_I start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT - ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_t ) italic_L start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( bold_x ( italic_t ) , bold_n start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_t ) , bold_v ) roman_d italic_t .

When kee** θ𝜃\thetaitalic_θ, ϕitalic-ϕ\phiitalic_ϕ, and γ𝛾\gammaitalic_γ fixed, this expression can be minimized pixel-wise for ρ𝜌\rhoitalic_ρ as a linear least absolute deviation problem, as we show next.

Alternate treatment of the diffuse albedo. It is now possible to completely exclude the diffuse albedo ρ𝜌\rhoitalic_ρ from the optimization and simply replace it by the optimal value

ρθ,γ=argminρERGB(θ,ϕ,γ,ρ)subscriptsuperscript𝜌𝜃𝛾subscriptargmin𝜌subscript𝐸RGB𝜃italic-ϕ𝛾𝜌\rho^{*}_{\theta,\gamma}=\operatorname*{arg\,min}_{\rho}E_{\text{RGB}}(\theta,% \phi,\gamma,\rho)italic_ρ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ , italic_γ end_POSTSUBSCRIPT = start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT italic_E start_POSTSUBSCRIPT RGB end_POSTSUBSCRIPT ( italic_θ , italic_ϕ , italic_γ , italic_ρ ) (13)

with respect to the other parameters. In our case, we are facing a linear least absolute deviation problem which can be solved individually for each pixel.

dog1

Refer to caption Refer to caption Refer to caption Refer to caption

dog2

Refer to caption Refer to caption Refer to caption Refer to caption
Input image [62] [66] Ours
Figure 4: Results for both dog1 and dog2 obtained with 5 viewpoints reveal a notable performance degradation in [62] and [66] when confronted with complex textures. This deterioration is attributed to a generalization failure of their learned priors. In contrast, our approach, which does not rely on any prior, provides a superior reconstruction that is unaffected by the complex texture.

We thus approximate the solution using a few iterations of iterative reweighted least squares [6], as it efficiently solves the problem while providing a differentiable expression. In order to simplify notation, we omit the dependency w.r.t. the parameters and denote only the channel-wise operations, where we iteratively compute

ρpρp(k)=bpWp(k)apapWp(k)apsubscriptsuperscript𝜌𝑝subscriptsuperscript𝜌𝑘𝑝superscriptsubscript𝑏𝑝topsubscriptsuperscript𝑊𝑘𝑝subscript𝑎𝑝superscriptsubscript𝑎𝑝topsubscriptsuperscript𝑊𝑘𝑝subscript𝑎𝑝\rho^{*}_{p}\approx\rho^{(k)}_{p}=\dfrac{b_{p}^{\top}W^{(k)}_{p}a_{p}}{a_{p}^{% \top}W^{(k)}_{p}a_{p}}italic_ρ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ≈ italic_ρ start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT = divide start_ARG italic_b start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_W start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_ARG start_ARG italic_a start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_W start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_ARG (14)

with diagonal matrices

Wp(k)={diag(max(ϵ,|bpρp(k1)ap|)1)if k1,Iotherwise.W^{(k)}_{p}=\begin{cases}\text{diag}(\max(\epsilon,|b_{p}-\rho^{(k-1)}_{p}a_{p% }|)^{-1})&\text{if }k\geq 1,\\[1.29167pt] I&\text{otherwise.}\end{cases}italic_W start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT = { start_ROW start_CELL diag ( roman_max ( italic_ϵ , | italic_b start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT - italic_ρ start_POSTSUPERSCRIPT ( italic_k - 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT | ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) end_CELL start_CELL if italic_k ≥ 1 , end_CELL end_ROW start_ROW start_CELL italic_I end_CELL start_CELL otherwise. end_CELL end_ROW (15)

Above, k𝑘kitalic_k is the index of iteration, and apsubscript𝑎𝑝a_{p}italic_a start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT and bpsubscript𝑏𝑝b_{p}italic_b start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT are the vectors whose components are given in Eq. 12. The small constant ϵ>0italic-ϵ0\epsilon>0italic_ϵ > 0 avoids numerical issues, and I𝐼Iitalic_I is an identity matrix of matching size. The gradient of ρp(k)subscriptsuperscript𝜌𝑘𝑝\rho^{(k)}_{p}italic_ρ start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT with respect to θ𝜃\thetaitalic_θ and γ𝛾\gammaitalic_γ is computed using automatic differentiation, however the gradient of Wp(k)subscriptsuperscript𝑊𝑘𝑝W^{(k)}_{p}italic_W start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT is not back-propagated, as it makes the optimization much more difficult and degrades the quality of the results.

In the forthcoming evaluation section, we demonstrate that this formulation delivers exceptional performance in 3333D reconstruction from sparse viewpoints. Notably, it proves particularly advantageous in the most sparse scenarios, where only two viewpoints are available, showcasing its superiority over the previous MLP-based formulation.

6 Results

\downarrowRMSE×100absent100\times 100× 100 \downarrowMAE
[62] [10]D [10]P [66] Ours [62] [10]D [10]P [66] Ours
dog1 2.1 8.9 13.9 2.1 0.4 22.0 39.9 39.2 18.8 04.6
dog2 2.2 8.4 13.4 2.1 0.4 23.2 29.6 40.3 19.5 04.8
girl1 0.9 6.1 02.3 1.5 0.3 17.2 41.8 46.5 23.8 09.6
girl2 1.5 5.2 04.8 1.5 0.3 23.9 40.0 47.3 23.4 10.2
Table 1: RMSE and MAE for full 3D reconstructions. RMSE is computed based on the vertex to mesh distance, and the MAE is computed using the angular error between the normals of a vertex and its closest point in the ground truth mesh. OurAlbedoNet leads to highly similar results as our main framework in this scenario.

squirrel

Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption

bird

Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption

dog2

Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Viewpoint 1 Viewpoint 2 [66] OurAlbedoNet Ours
Figure 5: Reconstructions from two distant viewpoints. Severe artefacts can be noticed for both [66] and OurAlbedoNet, and more particularly [66] introduce significant distortions in the shapes, as a result of using normal maps which do not provide strong enough constraints on the absolute position of the points.

To substantiate the efficacy of our framework, we conduct evaluations on both synthetic and real-world datasets. We generate four synthetic scans by combining two distinct geometries with two different materials. The first material is white, rendering it textureless, while the second material exhibits a high degree of texture. This design enables us to quantify the influence of texture on the obtained results. The four synthetic scans are denoted as dog1, dog2, girl1, and girl2. Additionally, we acquire real-world data for six objects, namely, bird, squirrel, hawk, rooster, flamingo and pumpkin. We also consider the DiLiGenT-MV dataset [30].
Evaluation. We first evaluate full 3D reconstruction using only six viewpoints in total, except for dog1 and dog2 where five viewpoints are considered. We evaluate against both state-of-the-art sparse viewpoint reconstruction approaches assuming static illumination [35, 51, 71, 62] and photometric stereo approaches [10, 66]. Since [10] allows for directional and point-light illumination, we consider both cases in our evaluation. [10]D and [10]P refers to the directional and point-light version, respectively. Given the calibrated nature of their framework, we furnish the estimated lighting parameters using our approach. Additionally, we explore a more challenging scenario wherein only two viewpoints with a wide baseline are available. We therefore introduce an ablated version of our framework, wherein the diffuse albedo is modeled with an MLP. This adaptation allows to highlight the advantages of employing the optimal diffuse albedo introduced in Sec. 5.2.
Full 3D reconstruction. As anticipated, [35, 51, 71] produce degenerate meshes in our wide baseline setup. Results in the supplementary demonstrate that they perform adequately when the distance between cameras is sufficiently small, although our framework consistently outperforms them. Conversely, Tab. 1 and Fig. 3 illustrate that our approach surpasses relevant baselines [62, 10, 66] both quantitatively and qualitatively. Our method yields sharper details without visible artifacts. Furthermore, error maps (shown in the supplementary material) reveal that the geometry obtained with [66] exhibits global distortions. Additionally, Fig. 4 demonstrates that both [62, 66] perform poorly when confronted with complex textures, while our results remain unaffected. This suggests that their pre-trained priors failed to generalize for the given texture, possibly misassigning it as geometric patterns. This shows the advantage of approaches without relying on learned priors. The results for the DiLiGenT-MV dataset [30] are shown in the supplementary, demonstrating a high accuracy in the classical scenario of a dark room and distant light sources.
Two viewpoints. We denote our ablated framework, which uses a diffuse albedo network, as OurAlbedoNet. We exclude [62] from consideration as they require at least three viewpoints. Fig. 5 vividly illustrates the advantage of our framework in this challenging scenario, especially with the use of the optimal diffuse albedo. Notably, [66] and OurAlbedoNet exhibit pronounced artifacts and inaccuracies in geometric details in certain regions. Moreover, [66] results in severe distortions of the shape. We attribute this to their heavy reliance on per-view normal maps for reconstruction, which lacks absolute depth information about the shape. In contrast, our method directly estimates the shape from various point-light illuminations, coupled with a suitable model (Eq. 1). This approach provides relevant cues about the absolute position of the points, contributing to more accurate and high-quality reconstructions.
Limitations and future work. A dedicated section is available in the supplementary.

7 Conclusion

We introduced the first framework for multi-view uncalibrated point-light photometric stereo. It combines a state-of-the-art volume rendering approach with a physically realistic illumination model consisting of an ambient component and uncalibrated point-lights. As demonstrated in a variety of experiments, the proposed approach offers a practical paradigm to create highly accurate 3333D reconstructions from sparse and distant viewpoints, even outside a controlled dark room environment.
Acknowledgment This work was supported by the ERC Advanced Grant SIMULACRON, and the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under Germany’s Excellence Strategy – EXC 2117 – 422037984.

References

  • Asthana et al. [2022] Meghna Asthana, William Smith, and Patrik Huber. Neural apparent brdf fields for multiview photometric stereo. In Proceedings of the 19th ACM SIGGRAPH European Conference on Visual Media Production, pages 1–10, 2022.
  • Basri and Jacobs [2003] Ronen Basri and David W Jacobs. Lambertian reflectance and linear subspaces. IEEE transactions on pattern analysis and machine intelligence, 25(2):218–233, 2003.
  • Bi et al. [2020] Sai Bi, Zexiang Xu, Kalyan Sunkavalli, Miloš Hašan, Yannick Hold-Geoffroy, David Kriegman, and Ravi Ramamoorthi. Deep reflectance volumes: Relightable reconstructions from multi-view photometric images. In European Conference on Computer Vision, pages 294–311. Springer, 2020.
  • Brahimi et al. [2020] Mohammed Brahimi, Yvain Quéau, Bjoern Haefner, and Daniel Cremers. On the Well-Posedness of Uncalibrated Photometric Stereo Under General Lighting, pages 147–176. Springer International Publishing, Cham, 2020.
  • Brahimi et al. [2022] Mohammed Brahimi, Bjoern Haefner, Tarun Yenamandra, Bastian Goldluecke, and Daniel Cremers. Supervol: Super-resolution shape and reflectance estimation in inverse volume rendering. arXiv preprint arXiv:2212.04968, 2022.
  • Burrus [2012] C Sidney Burrus. Iterative reweighted least squares. OpenStax CNX. Available online: http://cnx. org/contents/92b90377-2b34-49e4-b26f-7fe572db78a1, 12, 2012.
  • Cerkezi and Favaro [2023] Llukman Cerkezi and Paolo Favaro. Sparse 3d reconstruction via object-centric ray sampling. arXiv preprint arXiv:2309.03008, 2023.
  • Chen et al. [2019] Guanying Chen, Kai Han, Boxin Shi, Yasuyuki Matsushita, and Kwan-Yee K Wong. Self-calibrating deep photometric stereo networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8739–8747, 2019.
  • Cheng et al. [2020] Shuo Cheng, Zexiang Xu, Shilin Zhu, Zhuwen Li, Li Erran Li, Ravi Ramamoorthi, and Hao Su. Deep stereo using adaptive thin volume representation with uncertainty awareness. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2524–2534, 2020.
  • Cheng et al. [2022] Ziang Cheng, Hongdong Li, Richard Hartley, Yinqiang Zheng, and Imari Sato. Diffeomorphic neural surface parameterization for 3d and reflectance acquisition. In ACM SIGGRAPH 2022 Conference Proceedings, pages 1–10, 2022.
  • Cheng et al. [2023] Ziang Cheng, Junxuan Li, and Hongdong Li. Wildlight: In-the-wild inverse rendering with a flashlight. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4305–4314, 2023.
  • Community [2018] Blender Online Community. Blender - a 3D modelling and rendering package. Blender Foundation, Stichting Blender Foundation, Amsterdam, 2018.
  • Darmon et al. [2022] François Darmon, Bénédicte Bascle, Jean-Clément Devaux, Pascal Monasse, and Mathieu Aubry. Improving neural implicit surfaces geometry with patch war**. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6260–6269, 2022.
  • Ding et al. [2022] Yikang Ding, Wentao Yuan, Qingtian Zhu, Haotian Zhang, Xiangyue Liu, Yuanjiang Wang, and Xiao Liu. Transmvsnet: Global context-aware multi-view stereo network with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8585–8594, 2022.
  • Fu et al. [2022] Qiancheng Fu, Qingshan Xu, Yew Soon Ong, and Wenbing Tao. Geo-neus: Geometry-consistent neural implicit surfaces learning for multi-view reconstruction. Advances in Neural Information Processing Systems, 35:3403–3416, 2022.
  • Gropp et al. [2020] Amos Gropp, Lior Yariv, Niv Haim, Matan Atzmon, and Yaron Lipman. Implicit geometric regularization for learning shapes. arXiv preprint arXiv:2002.10099, 2020.
  • Gu et al. [2020] Xiaodong Gu, Zhiwen Fan, Siyu Zhu, Zuozhuo Dai, Feitong Tan, and ** Tan. Cascade cost volume for high-resolution multi-view stereo and stereo matching. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2495–2504, 2020.
  • Guo et al. [2022] Heng Guo, Hiroaki Santo, Boxin Shi, and Yasuyuki Matsushita. Edge-preserving near-light photometric stereo with neural surfaces. arXiv preprint arXiv:2207.04622, 2022.
  • Haefner et al. [2019] B. Haefner, Z. Ye, M. Gao, T. Wu, Y. Quéau, and D. Cremers. Variational uncalibrated photometric stereo under general lighting. In IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, South Korea, 2019.
  • Haefner et al. [2020] Bjoern Haefner, Songyou Peng, Alok Verma, Yvain Quéau, and Daniel Cremers. Photometric depth super-resolution. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(10):2453–2464, 2020.
  • Haefner et al. [2021] Bjoern Haefner, Simon Green, Alan Oursland, Daniel Andersen, Michael Goesele, Daniel Cremers, Richard Newcombe, and Thomas Whelan. Recovering real-world reflectance properties and shading from hdr imagery. In 2021 International Conference on 3D Vision (3DV), pages 1075–1084. IEEE, 2021.
  • Hernandez et al. [2008] Carlos Hernandez, George Vogiatzis, and Roberto Cipolla. Multiview photometric stereo. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(3):548–554, 2008.
  • Huang et al. [2023] Shi-Sheng Huang, Zi-Xin Zou, Yi-Chi Zhang, and Hua Huang. Sc-neus: Consistent neural surface reconstruction from sparse and noisy views. arXiv preprint arXiv:2307.05892, 2023.
  • Ju et al. [2022] Yakun Ju, Kin-Man Lam, Wuyuan Xie, Huiyu Zhou, Junyu Dong, and Boxin Shi. Deep learning methods for calibrated photometric stereo and beyond: A survey. arXiv preprint arXiv:2212.08414, 2022.
  • Kajiya [1986] James T Kajiya. The rendering equation. In Proceedings of the 13th annual conference on Computer graphics and interactive techniques, pages 143–150, 1986.
  • Karis and Games [2013] Brian Karis and Epic Games. Real shading in unreal engine 4. Proc. Physically Based Shading Theory Practice, 4(3):1, 2013.
  • Kaya et al. [2022] Berk Kaya, Suryansh Kumar, Carlos Oliveira, Vittorio Ferrari, and Luc Van Gool. Uncertainty-aware deep multi-view photometric stereo. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12601–12611, 2022.
  • Kaya et al. [2023] Berk Kaya, Suryansh Kumar, Carlos Oliveira, Vittorio Ferrari, and Luc Van Gool. Multi-view photometric stereo revisited. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 3126–3135, 2023.
  • Kingma and Ba [2014] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • Li et al. [2020] Min Li, Zhenglong Zhou, Zhe Wu, Boxin Shi, Changyu Diao, and ** Tan. Multi-view photometric stereo: A robust solution and benchmark dataset for spatially varying isotropic materials. IEEE Transactions on Image Processing, 29:4159–4173, 2020.
  • Logothetis et al. [2016] Fotios Logothetis, Roberto Mecca, Yvain Quéau, and Roberto Cipolla. Near-field photometric stereo in ambient light. 2016.
  • Logothetis et al. [2017] Fotios Logothetis, Roberto Mecca, and Roberto Cipolla. Semi-calibrated near field photometric stereo. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 941–950, 2017.
  • Logothetis et al. [2019] Fotios Logothetis, Roberto Mecca, and Roberto Cipolla. A differential volumetric approach to multi-view photometric stereo. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1052–1061, 2019.
  • Logothetis et al. [2023] Fotios Logothetis, Roberto Mecca, Ignas Budvytis, and Roberto Cipolla. A cnn based approach for the point-light photometric stereo problem. International Journal of Computer Vision, 131(1):101–120, 2023.
  • Long et al. [2022] ** Wang. Sparseneus: Fast generalizable neural surface reconstruction from sparse views. In European Conference on Computer Vision, pages 210–227. Springer, 2022.
  • Luan et al. [2021] Fujun Luan, Shuang Zhao, Kavita Bala, and Zhao Dong. Unified shape and svbrdf recovery using differentiable monte carlo rendering. In Computer Graphics Forum, pages 101–113. Wiley Online Library, 2021.
  • MATLAB [2020] MATLAB. version 9.8.0.1873465 (R2020a) Update 8. The MathWorks Inc., Natick, Massachusetts, 2020.
  • Mecca et al. [2014] Roberto Mecca, Aaron Wetzler, Alfred M Bruckstein, and Ron Kimmel. Near field photometric stereo with point light sources. SIAM Journal on Imaging Sciences, 7(4):2732–2770, 2014.
  • Mildenhall et al. [2020] Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis. In ECCV, 2020.
  • Mildenhall et al. [2021] Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM, 65(1):99–106, 2021.
  • Nam et al. [2018] Giljoo Nam, Joo Ho Lee, Diego Gutierrez, and Min H Kim. Practical svbrdf acquisition of 3d objects with unstructured flash photography. ACM Transactions on Graphics (TOG), 37(6):1–12, 2018.
  • Niemeyer et al. [2020] Michael Niemeyer, Lars Mescheder, Michael Oechsle, and Andreas Geiger. Differentiable volumetric rendering: Learning implicit 3d representations without 3d supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3504–3515, 2020.
  • Oechsle et al. [2021] Michael Oechsle, Songyou Peng, and Andreas Geiger. Unisurf: Unifying neural implicit surfaces and radiance fields for multi-view reconstruction. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 5569–5579. IEEE, 2021.
  • Papadhimitri and Favaro [2014] Thoma Papadhimitri and Paolo Favaro. Uncalibrated near-light photometric stereo. 2014.
  • Park et al. [2016] Jaesik Park, Sudipta N Sinha, Yasuyuki Matsushita, Yu-Wing Tai, and In So Kweon. Robust multiview photometric stereo using planar mesh parameterization. IEEE transactions on pattern analysis and machine intelligence, 39(8):1591–1604, 2016.
  • Park et al. [2019] Jeong Joon Park, Peter Florence, Julian Straub, Richard Newcombe, and Steven Lovegrove. Deepsdf: Learning continuous signed distance functions for shape representation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 165–174, 2019.
  • Peng et al. [2023] Rui Peng, Xiaodong Gu, Luyang Tang, Shihe Shen, Fanqi Yu, and Ronggang Wang. Gens: Generalizable neural surface reconstruction from multi-view images. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
  • Peng et al. [2017] Songyou Peng, Bjoern Haefner, Yvain Queau, and Daniel Cremers. Depth super-resolution meets uncalibrated photometric stereo. In Proceedings of the IEEE International Conference on Computer Vision (ICCV) Workshops, 2017.
  • Quéau et al. [2017] Yvain Quéau, Tao Wu, and Daniel Cremers. Semi-calibrated near-light photometric stereo. In Scale Space and Variational Methods in Computer Vision: 6th International Conference, SSVM 2017, Kolding, Denmark, June 4-8, 2017, Proceedings 6, pages 656–668. Springer, 2017.
  • Quéau et al. [2018] Yvain Quéau, Bastien Durix, Tao Wu, Daniel Cremers, François Lauze, and Jean-Denis Durou. Led-based photometric stereo: Modeling, calibration and numerical solution. Journal of Mathematical Imaging and Vision, 60:313–340, 2018.
  • Ren et al. [2023] Yufan Ren, Tong Zhang, Marc Pollefeys, Sabine Süsstrunk, and Fang**hua Wang. Volrecon: Volume rendering of signed ray distance functions for generalizable multi-view reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16685–16695, 2023.
  • Sang et al. [2020] Lu Sang, Bjoern Haefner, and Daniel Cremers. Inferring super-resolution depth from a moving light-source enhanced rgb-d sensor: A variational approach. In 2020 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 1–10, 2020.
  • Sang et al. [2023] Lu Sang, Björn Häfner, Xingxing Zuo, and Daniel Cremers. High-quality rgb-d reconstruction via multi-view uncalibrated photometric stereo and gradient-sdf. In 2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 3105–3114, 2023.
  • Schönberger et al. [2016] Johannes L Schönberger, Enliang Zheng, Jan-Michael Frahm, and Marc Pollefeys. Pixelwise view selection for unstructured multi-view stereo. In European conference on computer vision, pages 501–518. Springer, 2016.
  • Shi et al. [2016] Boxin Shi, Zhe Wu, Zhipeng Mo, Dinglong Duan, Sai-Kit Yeung, and ** Tan. A benchmark dataset and evaluation for non-lambertian and uncalibrated photometric stereo. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3707–3716, 2016.
  • Sumner [2014] Rob Sumner. Processing raw images in matlab. Department of Electrical Engineering, University of California Sata Cruz, 2014.
  • Vora et al. [2023] Aditya Vora, Akshay Gadi Patil, and Hao Zhang. Divinet: 3d reconstruction from disparate views via neural template regularization. arXiv preprint arXiv:2306.04699, 2023.
  • Wang et al. [2021] Peng Wang, Lingjie Liu, Yuan Liu, Christian Theobalt, Taku Komura, and Wen** Wang. Neus: Learning neural implicit surfaces by volume rendering for multi-view reconstruction. NeurIPS, 2021.
  • Wei et al. [2021] Zizhuang Wei, Qingtian Zhu, Chen Min, Yisong Chen, and Guo** Wang. Aa-rmvsnet: Adaptive aggregation recurrent multi-view stereo network. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6187–6196, 2021.
  • Woodham [1980] Robert J Woodham. Photometric method for determining surface orientation from multiple images. Optical engineering, 19(1):139–144, 1980.
  • Wu et al. [2010] Chenglei Wu, Yebin Liu, Qionghai Dai, and Bennett Wilburn. Fusing multiview and photometric stereo for 3d reconstruction under uncalibrated illumination. IEEE transactions on visualization and computer graphics, 17(8):1082–1095, 2010.
  • Wu et al. [2023] Haoyu Wu, Alexandros Graikos, and Dimitris Samaras. S-volsdf: Sparse multi-view stereo regularization of neural implicit surfaces. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 3556–3568, 2023.
  • Xu et al. [2023] Luoyuan Xu, Tao Guan, Yuesong Wang, Wenkai Liu, Zhaojie Zeng, Junle Wang, and Wei Yang. C2f2neus: Cascade cost frustum fusion for high fidelity and generalizable neural surface reconstruction. arXiv preprint arXiv:2306.10003, 2023.
  • Yan et al. [2020] Jianfeng Yan, Zizhuang Wei, Hongwei Yi, Mingyu Ding, Runze Zhang, Yisong Chen, Guo** Wang, and Yu-Wing Tai. Dense hybrid recurrent multi-view stereo net with dynamic consistency checking. In European conference on computer vision, pages 674–689. Springer, 2020.
  • Yang et al. [2020] Jiayu Yang, Wei Mao, Jose M Alvarez, and Miaomiao Liu. Cost volume pyramid based depth inference for multi-view stereo. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4877–4886, 2020.
  • Yang et al. [2022] Wenqi Yang, Guanying Chen, Chaofeng Chen, Zhenfang Chen, and Kwan-Yee K Wong. Ps-nerf: Neural inverse rendering for multi-view photometric stereo. In European Conference on Computer Vision, pages 266–284. Springer, 2022.
  • Yao et al. [2018] Yao Yao, Zixin Luo, Shiwei Li, Tian Fang, and Long Quan. Mvsnet: Depth inference for unstructured multi-view stereo. In Proceedings of the European conference on computer vision (ECCV), pages 767–783, 2018.
  • Yao et al. [2019] Yao Yao, Zixin Luo, Shiwei Li, Tianwei Shen, Tian Fang, and Long Quan. Recurrent mvsnet for high-resolution multi-view stereo depth inference. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5525–5534, 2019.
  • Yariv et al. [2020] Lior Yariv, Yoni Kasten, Dror Moran, Meirav Galun, Matan Atzmon, Basri Ronen, and Yaron Lipman. Multiview neural surface reconstruction by disentangling geometry and appearance. Advances in Neural Information Processing Systems, 33:2492–2502, 2020.
  • Yariv et al. [2021] Lior Yariv, Jiatao Gu, Yoni Kasten, and Yaron Lipman. Volume rendering of neural implicit surfaces. In Thirty-Fifth Conference on Neural Information Processing Systems, 2021.
  • Yixun Liang [2023] Ying-Cong Chen Yixun Liang, Hao He. Rethinking rendering in generalizable neural surface reconstruction: A learning-based solution. arXiv, 2023.
  • Yu et al. [2022] Zehao Yu, Songyou Peng, Michael Niemeyer, Torsten Sattler, and Andreas Geiger. Monosdf: Exploring monocular geometric cues for neural implicit surface reconstruction. Advances in neural information processing systems, 35:25018–25032, 2022.
  • Zhang et al. [2020] **gyang Zhang, Yao Yao, Shiwei Li, Zixin Luo, and Tian Fang. Visibility-aware multi-view stereo network. arXiv preprint arXiv:2008.07928, 2020.
  • Zhang et al. [2021a] **gyang Zhang, Yao Yao, and Long Quan. Learning signed distance field for multi-view surface reconstruction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6525–6534, 2021a.
  • Zhang et al. [2021b] Kai Zhang, Fujun Luan, Qianqian Wang, Kavita Bala, and Noah Snavely. Physg: Inverse rendering with spherical gaussians for physics-based material editing and relighting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5453–5462, 2021b.
  • Zhang et al. [2022] Kai Zhang, Fujun Luan, Zhengqi Li, and Noah Snavely. Iron: Inverse rendering by optimizing neural sdfs and materials from photometric images. In IEEE Conf. Comput. Vis. Pattern Recog., 2022.
  • Zhang et al. [2012] Qing Zhang, Mao Ye, Ruigang Yang, Yasuyuki Matsushita, Bennett Wilburn, and Huimin Yu. Edge-preserving photometric stereo via depth fusion. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, pages 2472–2479. IEEE, 2012.
  • Zhao et al. [2023] Dongxu Zhao, Daniel Lichy, Pierre-Nicolas Perrin, Jan-Michael Frahm, and Soumyadip Sengupta. Mvpsnet: Fast generalizable multi-view photometric stereo. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 12525–12536, 2023.
\thetitle

Supplementary Material

Appendix A Network Details

A.1 Architecture

As mentioned in the main paper, we use two multilayer perceptrons (MLPs). The first one describes the geometry via an SDF, dθsubscript𝑑𝜃d_{\theta}italic_d start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT, and the other one is used for the specular parameters of the material, αγsubscript𝛼𝛾\alpha_{\gamma}italic_α start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT. The MLP of dθsubscript𝑑𝜃d_{\theta}italic_d start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT consists of 6666 layers of width 256256256256, with a skip connection at the 4444-th layer, while the MLP αγsubscript𝛼𝛾\alpha_{\gamma}italic_α start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT consist of 3333 layers of width 256256256256.
In order to compensate the spectral bias of MLPs [40], the input is encoded by positional encoding using 6666 frequencies for both dθsubscript𝑑𝜃d_{\theta}italic_d start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT and αγsubscript𝛼𝛾\alpha_{\gamma}italic_α start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT. For the ablation OurAlbedoNet, a third MLP describing the BRDF’s diffuse albedo, ργ1subscript𝜌subscript𝛾1\rho_{\gamma_{1}}italic_ρ start_POSTSUBSCRIPT italic_γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT, is considered. It consists of 4444 layers of width 512512512512, and the input is encoded by positional encoding using 12121212 frequencies.

A.2 Parameters and Cost Function

Similarly to [5, 76, 69], we assume that the scene of interest lies within the unit sphere, which can be achieved by normalizing the camera positions appropriately. To approximate the Volume rendering integral (4) using (5), we use m=98𝑚98m=98italic_m = 98 samples which are also used to approximate (3), all with the sampling strategy of [70].
We set the objective’s function trade-off parameters λ1=λ2=0.1subscript𝜆1subscript𝜆20.1\lambda_{1}=\lambda_{2}=0.1italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0.1. Furthermore, the terms of the objective function (7) and (8) consist of a batch size of 800800800800 (inside the silhouette) and 1000100010001000, respectively. For the mask term (9), we use the same batch as (7) and add 900900900900 additional rays outside the silhouette whose rays still intersect with the unit sphere.
Finally, we always normalize each objective function’s summand with its corresponding batch size.

A.3 Training

Our networks are trained using the Adam optimizer [29] with a learning rate initialized with 5e45e45\text{e}-45 e - 4 and decayed exponentially during training to 5e55e55\text{e}-55 e - 5, except for the MLP αγsubscript𝛼𝛾\alpha_{\gamma}italic_α start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT whose learning rate is constantly equal to 1e51e51\text{e}-51 e - 5. The light positions ϕitalic-ϕ\phiitalic_ϕ are initialized with the camera position of their corresponding viewpoint, with a learning rate initialized with 1e21e21\text{e}-21 e - 2, and decayed exponentially with the same rate as the other networks. The remaining parameters are kept to Pytorch’s default.
We train for 800800800800 epochs, which lasts about 6 hours using a single NVIDIA Titan GTX GPU with 12121212GB memory and 6666 viewpoints.

Appendix B Data Acquisition

In this section, we describe how we generated the datasets used in this paper.

B.1 Synthetic Data

The synthetic datasets dog1, dog2, girl1, girl2 were generated using Blender [12] and Matlab [37], where Blender [12] is used to render normal, depth and BRDF parameter maps for each viewpoint, and Matlab [37] is used to render images using equation (1) of the main paper. We used 20202020 point light illuminations for each viewpoint, with a ratio of 70%percent7070\%70 % of point light intensity (thus 30%percent3030\%30 % of ambient light), and we also added a zero-mean Gaussian noise with a standard deviation σ=0.02𝜎0.02\sigma=0.02italic_σ = 0.02.

B.2 Real World Data

In order to generate the real-world datasets squirrel, bird, hawk, rooster, flamingo and pumpkin, we used a Samsung Galaxy Note 8 and the application "CameraProfessional"https://play.google.com/store/apps/details?id=com.azheng.camera.professional to generate RAW images as well as the smartphone’s images in parallel. We use the RAW images for our algorithm, and we pre-processed those using Matlab [37] by following [56]. Since our approach assumes very precise camera parameters, and in order to facilitate calibration, we captured a higher amount of viewpoints and used COLMAP [54] to obtain both camera poses and intrinsics with the smartphone’s images.
We move a hand-held LEDWe use white LUXEON Rebel LED: https://luxeonstar.com/product-category/led-modules/ to obtain 20202020 images with different point light illumination per viewpoint.

Appendix C Small camera baseline

As mentioned in the main paper, [35, 51, 71] lead to degenerate meshes when considering distant cameras, and are thus not suited to reconstruct full 3333D objects from sparse viewpoints. For a more fair comparison, we focus here on a different scenario, where only a part of the object is reconstructed from three viewpoints with a very small camera baseline. Fig. 1 clearly indicates that our approach allows for much more accurate and complete 3333D reconstruction than [35, 51, 71]. Note that all the meshes were obtained solely by using the official implementations.

girl1

Refer to caption
Refer to caption
Refer to caption
Refer to caption Refer to caption Refer to caption Refer to caption

girl2

Refer to caption
Refer to caption
Refer to caption
Refer to caption Refer to caption Refer to caption Refer to caption

dog1

Refer to caption
Refer to caption
Refer to caption
Refer to caption Refer to caption Refer to caption Refer to caption

dog2

Refer to caption
Refer to caption
Refer to caption
Refer to caption Refer to caption Refer to caption Refer to caption
Input viewpoints [35] [51] [71] Ours
Figure 1: Results using three viewpoints with small camera baseline.

Appendix D Error maps

For a better appreciation of the quality of the full 3333D reconstructions shown in the main paper, we show both the vertex-to-mesh distance and angular error maps in Fig. 2 and Fig. 3 respectively. We can see that our approach performs much better than the baseline at both the coarse and fine levels. Hence, it not only produces visually more pleasant reconstructions as can be seen in the main paper, but also with much higher fidelity.

girl1

Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption

girl2

Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption

dog1

Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption

dog2

Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
[62] [10] (directional light) [10] (point-light) [66] Ours
000.0250.0250.0250.0250.050.050.050.05
Figure 2: Vertex-to-mesh distance error maps. Errors are truncated for better visibility.

girl1

Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption

girl2

Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption

dog1

Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption

dog2

Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
[62] [10] (directional light) [10] (point-light) [66] Ours
004545454590909090
Figure 3: Angular error maps. Note that for girl1 and girl2, a region of the plate at the bottom has significant errors for all approaches. This is due to the fact that in the ground truth mesh, vertices are only on the edges at that region, thus the angular error is not accurate there.

Appendix E Additional results

We can see in Fig. 4 the reconstruction results of our real-world scans that were not shown in the main paper. In order to further assess the quality of our framework on diverse materials, we performed an evaluation on the DiLiGenT-MV dataset [30]. Despite being captured with distant light sources, thereby satisfying the directional lighting assumption used in the baseline, our framework still achieves the best results both quantitatively and qualitatively as can be seen respectively in Tab. 2 and Fig. 5. Finally, we also show some relighting results in Fig. 8, together with the optimal diffuse albedo. This shows the validity of the estimated material parameters which can be successfully used for relighting, and indicates a proper disentanglement of the scene in terms of shape and material.

Refer to captionRefer to caption Refer to captionRefer to caption Refer to captionRefer to caption Refer to captionRefer to caption Refer to captionRefer to caption Refer to captionRefer to caption
Refer to captionRefer to caption Refer to captionRefer to caption Refer to captionRefer to caption Refer to captionRefer to caption Refer to caption Refer to caption Refer to captionRefer to caption
Refer to captionRefer to caption Refer to captionRefer to caption Refer to captionRefer to caption Refer to captionRefer to caption Refer to captionRefer to caption Refer to captionRefer to caption
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Input image [62] [10] (directional light) [10] (point-light) [66] Ours
Figure 4: Full 3D reconstruction of real objects from 6 viewpoints.
Refer to caption Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption Refer to caption
[10] [66] Ours Ground truth
Figure 5: Results on the DiLiGenT-MV dataset [30] using 6 viewpoints
\downarrowMAE \downarrowRMSE×100absent100\times 100× 100
[10] [66] ours [10] [66] ours
bear 19.1 8.8 3.5 2.2 1.0 0.6
buddha 39.6 13.9 10.8 2.5 0.7 0.5
pot2 29.2 9.2 5.1 10.2 0.7 0.5
reading 34.7 11.9 7.1 10.4 1.1 0.9
cow 25.3 8.9 3.6 6.1 0.9 0.5
Table 2: MAE and RMSE for the DiLiGenT-MV dataset [30].

Appendix F Effect of the ratio of point light

We further analyze the effect of the ratio of point light intensity on the quality of the result. This allows us to know how much ambient light can be handled by our approach while still providing accurate reconstructions. We remind that the total radiance can be decomposed into the sum of the point light radiance and the ambient light radiance, and we obtain the point light images by subtracting the input images with the ambient image. As discussed in section (3.2) of the main paper, one key issue with this strategy is that decreasing the amount of point light intensity yields point light images with a worse signal-to-noise ratio, which will inevitably affect the quality of the result. Consequently, we evaluate our approach on dog2 using the same five viewpoints as in the main paper to obtain a full 3333D reconstruction, with a point light intensity ratio ranging from 10%percent1010\%10 % to 100%percent100100\%100 % (dark room). We also consider two levels of noise, with standard deviations σ{0.02,0.04}𝜎0.020.04\sigma\in\{0.02,0.04\}italic_σ ∈ { 0.02 , 0.04 }. As shown in Tab. 3 and Fig. 8, the quality of the result indeed improves as expected when increasing the amount of point light intensity. Moreover, for a given desired accuracy, a higher amount of point light is required for a noisier sensor, since this last yields the worst signal-to-noise ratio for the point light images. Nevertheless, even with a significant amount of noise, a reasonable result can be obtained starting from 40%percent4040\%40 % of point light intensity, and a high accuracy with 70%percent7070\%70 % and above. As mentioned in section (3.2) of the main paper, satisfying those requirements in practice is highly facilitated by the fact that near point lights are handled properly by our framework, in contrast to the majority of photometric stereo frameworks which require distant lights.

\downarrowRMSE×1000absent1000\times 1000× 1000 \downarrowMAE
10%percent1010\%10 % 20%percent2020\%20 % 30%percent3030\%30 % 40%percent4040\%40 % 50%percent5050\%50 % 60%percent6060\%60 % 70%percent7070\%70 % 80%percent8080\%80 % 90%percent9090\%90 % 100%percent100100\%100 % 10%percent1010\%10 % 20%percent2020\%20 % 30%percent3030\%30 % 40%percent4040\%40 % 50%percent5050\%50 % 60%percent6060\%60 % 70%percent7070\%70 % 80%percent8080\%80 % 90%percent9090\%90 % 100%percent100100\%100 %
σ=0.02𝜎0.02\sigma=0.02italic_σ = 0.02 10.3 5.6 4.3 3.9 3.3 3.4 3.2 3.2 2.9 2.6 12.7 7.8 6.3 5.7 5.2 5.0 4.7 4.8 4.6 4.4
σ=0.04𝜎0.04\sigma=0.04italic_σ = 0.04 16.1 10.3 6.5 5.2 4.5 3.9 3.6 3.6 3.5 3.2 16.7 12.8 9.4 7.8 6.9 6.2 5.8 5.5 5.4 5.0
Table 3: RMSE and MAE for different ratios of point light intensity, and two different levels of noise. RMSE is computed based on the vertex-to-mesh distance, and the MAE is computed using the angular error between the normals of a vertex and its closest point in the ground truth mesh.

σ=0.02𝜎0.02\sigma=0.02italic_σ = 0.02

Refer to caption Refer to caption Refer to caption Refer to caption

σ=0.04𝜎0.04\sigma=0.04italic_σ = 0.04

Refer to caption Refer to caption Refer to caption Refer to caption
20%percent2020\%20 % 40%percent4040\%40 % 70%percent7070\%70 % 100%percent100100\%100 %
Refer to caption
ground truth
Figure 6: Results on dog2 for different ratios of point light intensity. The first and second rows correspond to the results with Gaussian noise of standard deviation σ=0.02𝜎0.02\sigma=0.02italic_σ = 0.02 and 0.040.040.040.04 respectively.
Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption
Ground truth Our rendering Optimal albedo
Figure 7: Relighting results.
5L 10L 20L
4V 7.7 6.1 5.5
6V 6.4 5.0 4.6
8V 6.0 4.6 4.3
12V 5.1 4.2 3.9
(a) dog2
5L 10L 20L
4V 13.7 12.4 12.4
6V 12.1 10.7 10.6
8V 11.3 10.3 10.1
12V 10.1 9.4 9.4
(b) girl2
Figure 8: MAE for different number of Viewpoints / Lights.

Appendix G Effect of the number of viewpoints and lights

Fig. 8 shows the MAE for both dog2 and girl2 using different numbers of viewpoints and light sources. Four viewpoints and five light sources allow to obtain a decent full 3333D reconstruction, and six viewpoints and ten light sources are already enough for a high quality result.

Appendix H Limitations

The optimal diffuse albedo allows to obtain great 3333D reconstruction results in the most sparse scenarios. However, it is only defined for the viewpoints used for training, and is not multiview consistent, hindering novel view synthesis from arbitrary viewpoint. On the other hand, this issue is mitigated with our ablation OurAlbedoNet by using a neural diffuse albedo, at the cost of failing in some highly sparse scenarios. A straightforward solution would be to first use the optimal diffuse albedo strategy, then fix the geometry and specular parameters, and learn a neural diffuse albedo in a second stage. Successfully achieving multiview consistent diffuse albedo in the most sparse scenarios without relying on a second stage might increase the overall robustness, and is left as a future work. Moreover, we presume the availability of camera poses, acknowledging the challenge of pose estimation, particularly in the context of sparse viewpoints. A valuable extension of our work could be to address this assumption, e.g., based on [23]. Finally, our BRDF choice is limited to opaque, non-metallic objects. Expanding our framework beyond those materials represents an intriguing avenue for future exploration.