(eccv) Package eccv Warning: Package ‘hyperref’ is loaded with option ‘pagebackref’, which is *not* recommended for camera-ready version

11institutetext: Toshiba Europe
11email: [email protected]
22institutetext: University of Cambridge
22email: [email protected]

NPLMV-PS: Neural Point-Light Multi-View Photometric Stereo

Fotios Logothetis 11    Ignas Budvytis 11    Roberto Cipolla 1122
Abstract

In this work we present a novel multi-view photometric stereo (PS) method. Like many works in 3D reconstruction we are leveraging neural shape representations and learnt renderers. However, our work differs from the state-of-the-art multi-view PS methods such as PS-NeRF [40] or SuperNormal [1] we explicity leverage per-pixel intensity renderings rather than relying mainly on estimated normals.

We model point light attenuation and explicitly raytrace cast shadows in order to best approximate each points incoming radiance. This is used as input to a fully neural material renderer that uses minimal prior assumptions and it is jointly optimised with the surface. Finally, estimated normal and segmentation maps can also incorporated in order to maximise the surface accuracy.

Our method is among the first to outperform the classical approach of DiLiGenT-MV and achieves average 0.2mm Chamfer distance for objects imaged at approx 1.5m distance away with approximate 400×400400400400\times 400400 × 400 resolution. Moreover, we show robustness to poor normals in low light count scenario, achieving 0.27mm Chamfer distance when pixel rendering is used instead of estimated normals.

Keywords:
Multiview Photometric Stereo Neural Rendering Neural Point Light Rendering

1 Introduction

Photometric Stereo (PS) is a long standing and very important problem in the field of Computer Vision problems. While early PS works [39, 11, 27, 13] primarily tackled the estimation of normals from a single view image, their potential was really unlocked by binocular [18, 4, 38, 24] and multi-view [5, 46, 34, 28, 21, 40] stereo setups as it allowed for a more accurate recovery of shape and not only normals, opening multiple applications such as general 3D reconstruction of objects for novel-view rendering, relighting or material editing [40], robot interaction, quality control in manufacturing or industrial conveyor belt scanning.

Refer to caption
Figure 1: In this figure we demonstrate the fragility of relying mainly on estimated normals (using [9]) for deep learning based multi-view photometric stereo. The first column shows the ground truth of Buddha object from DiLiGenT-MV benchmark. The second column shows the estimated normals using only 6 out of 96 lights available (view 1 mean average normal error is 13.5superscript13.513.5^{\circ}13.5 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT). Using such normals leads to a large error in reconstruction (0.49mm). If intensities are used instead an error of 0.29mm is achieved. The right column shows the result using all lights and we note that we outperform DiLiGenT-MV [21], PS-NeRF [40] and MVAS [2] and only loose to SuperNormal [1], as shown in Table 2.

Along with increasing the number of views Photometric Stereo undergone another important change by moving from classical non-linear optimisation enabled inverse graphics approaches (for single view [10], binocular [18], multi-view [28, 21]) to neural network (e.g. [11, 26]) and in particular neural representation enabled inverse graphics approaches (for single view [8], binocular [24] and multi-view [40, 1]). Despite the latter methods (especially [1]) achieving impressive accuracy on DiLiGenT-MV [21] benchmark their approaches are somewhat unsatisfactory. In particular,  [1] does not explicitly use image intensities to optimise for shape and it is fully reliant on per-view normal maps. Whereas, PS-NeRF [40] only uses average intensity during the surface optimisation stage and thus leaves most of the photometric information unused.

In this work we provide the first neural multi-view photometric stereo approach which fully leverages the availability of pixel intensity information for estimating 3D shape from Photometric Stereo images (see Figure 5). We achieve this by explicitly modeling the incident light from point light sources to leverage intensity based shape optimisation over purely normal driven shape optimisation [40, 1] which is fragile to incorrectly estimated normals especially in cases of very few available lights.

We model point light attenuation, explicitly raytrace cast shadows and also optimise a fully neural material renderer. Thus we are able to achieve competitive reconstruction accuracy using only 6 lights and match the SOTA performance when all lights and normal map information is also fused.

The remainder of this paper is organised as follows. Section 2 discusses the related works in Photometric Stereo and Multi-View Stereo. It is followed by a description of our method in Section 3. The experimental setup and experiment results are described in Sections 4 and 5 respectively.

2 Related Work

There is an extensive literature on single and multi-view photometric stereo and we review the following cases:

Single view photometric stereo. The first successful deep learning based single view PS was CNN-PS [11] which was extended by [26] and [27] to be applicable to general calibrated point like configurations. Other works like [7, 5, 14] have used material reflectance priors (using specific BRDFs like Lambertian or Ward) for single view normal prediction. Other recent approaches have also tackled a more uncalibrated setting like [3, 20, 19, 41] and more recently the idea of universal uncalibrated PS [12, 13] and [9].

Multi-view photometric stereo. Classical optimisation approaches have used triangle meshes [34] or sign distance function based parameterisations [28, 33, 47] to tackle the multi-view PS problem, under diffuse reflectance. More advanced methods such as [21] are more applicable to more general materials.

Neural surfaces. Recently, neural surface approaches have become very popular from the introduction of NeRF [32] and its extension to neural SDF parameterisations in [43, 42]. The first approaches to used neural SDFs for specifically the multi-view PS problem are [17, 15, 16, 45], PS-NeRF [40] and SuperNormal [1]. However, these approaches are mainly relying on normal fusion and do not attempt to update the neural surface though rendering. Other recent approaches have advanced the sophistication of the rendering methods to be more structured and thus respect the physics of light reflection more, like Ref-NeRF [37], Neuralangelo [22], NERO [23] and NeILF++ [44] but non of these methods has yet to be applied to PS problem, especially lacking the ability to model point light illumination. Finally it is worth mentioning [8] who introduced the idea of a infinitely differentiable surface (SIREN [36]) with Lambertian rendering for the single view PS and [25], which extended this method to the binocular setting and also added a fully learnable general material renderer.

We borrow the material renderer from [25] while extending the approach to work in the multi-view setting using an SDF paremeterisation similar to [43, 42].

3 Method

This section describes our method for solving the point light multi-view photometer stereo. A high level overview is also shown in Figure 2. Our method is primarily an inverse neural rendering method. Section 3.1 describes the assumed irradiance model. Section 3.2 describes the underlying neural surface parametersiation and Section 3.4 describes training losses used.

Refer to caption
Figure 2: High level schematic of our overall method. Single view PS is used to obtain normal maps. Training the SDF with normal loss obtains a rough surface which is then refined with full volumetric rendering. The rendering procedure is explored in Figure 3. The second row also shows the GT and render images (as grayscale), the rendering error (with red =0.1) as well as the computed shadow map.

3.1 Irradiance equation

We now explain the assumed irradiance of a world point 𝐗𝐗\boldsymbol{\mathbf{X}}bold_X with surface normal 𝐍𝐍\boldsymbol{\mathbf{N}}bold_N and albedo ρ𝜌\rhoitalic_ρ. We assume point light sources m𝑚mitalic_m at positions 𝐏msubscript𝐏𝑚\mathbf{P}_{m}bold_P start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT which generate variable lighting vectors 𝐋m=𝐏m𝐗subscript𝐋𝑚subscript𝐏𝑚𝐗\mathbf{L}_{m}=\mathbf{P}_{m}-\mathbf{X}bold_L start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = bold_P start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT - bold_X. In addition, point light propagation results to the following attenuation factor am=ϕm||𝐋m)||2a_{m}=\frac{\phi_{m}}{||\mathbf{L}_{m})||^{2}}italic_a start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = divide start_ARG italic_ϕ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_ARG start_ARG | | bold_L start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG where ϕmsubscriptitalic-ϕ𝑚\phi_{m}italic_ϕ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT is the intrinsic brightness of the light source. We note that the literature [31] usually also assumes angular dissipation factor but these calibration numbers are unavailable for DiLiGenT-MV [21] therefore we opt for the simpler, perfect point light source model. Thus, the reflected intensity of that point (not include 𝐗𝐗\boldsymbol{\mathbf{X}}bold_X for clarity) for the m𝑚mitalic_m th light source imsubscript𝑖𝑚i_{m}italic_i start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT is:

im=smamρB(𝐍,𝐋m,𝐕)+ir,msubscript𝑖𝑚subscript𝑠𝑚subscript𝑎𝑚𝜌𝐵𝐍subscript𝐋𝑚𝐕subscript𝑖𝑟𝑚i_{m}=s_{m}a_{m}\rho B(\boldsymbol{\mathbf{N}},\boldsymbol{\mathbf{L}}_{m},% \boldsymbol{\mathbf{V}})+i_{r,m}italic_i start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = italic_s start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT italic_ρ italic_B ( bold_N , bold_L start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , bold_V ) + italic_i start_POSTSUBSCRIPT italic_r , italic_m end_POSTSUBSCRIPT (1)

B is assumed to be a general BRDF, s{0,1}𝑠01s\in\{0,1\}italic_s ∈ { 0 , 1 } is an indicator variable to account for cast shadows that completely block direct reflectance. ir,msubscript𝑖𝑟𝑚i_{r,m}italic_i start_POSTSUBSCRIPT italic_r , italic_m end_POSTSUBSCRIPT corresponds to indirect reflectance (e.g. self reflections) and we will further make the simplifying assumption that it is also proportional to the surface albedo as well as the light source attenuation. Finally, we note that self reflection occur on concave parts of an objects surface which also correlate with more cast shadows therefore we make the following assumption ir(aρ(1s))proportional-tosubscript𝑖𝑟𝑎𝜌1𝑠i_{r}\propto(a\rho\sum(1-s))italic_i start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ∝ ( italic_a italic_ρ ∑ ( 1 - italic_s ) ). Indeed, the term ambient occlusion is usually used to describe the total obliqness of a surface point and this is approximated with the number of shadows. To simplify the learning problem, we assume that irsubscript𝑖𝑟i_{r}italic_i start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT is constant between different lights thus ir,m=aρsR(𝐗)subscript𝑖𝑟𝑚𝑎𝜌𝑠𝑅𝐗i_{r,m}=a\rho\sum sR(\boldsymbol{\mathbf{X}})italic_i start_POSTSUBSCRIPT italic_r , italic_m end_POSTSUBSCRIPT = italic_a italic_ρ ∑ italic_s italic_R ( bold_X ). This assumption is further justified empirically on the supplementary by empirical measurements on Blender-rendered data.

3.2 Neural SDF

Refer to caption
Figure 3: Visualisation of our volume rendering approach. Two rays with multiple ray samples 𝐗risubscript𝐗𝑟𝑖\boldsymbol{\mathbf{X}}_{ri}bold_X start_POSTSUBSCRIPT italic_r italic_i end_POSTSUBSCRIPT, and 𝐘risubscript𝐘𝑟𝑖\boldsymbol{\mathbf{Y}}_{ri}bold_Y start_POSTSUBSCRIPT italic_r italic_i end_POSTSUBSCRIPT and shown. The average intersection points 𝐗Asubscript𝐗𝐴\boldsymbol{\mathbf{X}}_{A}bold_X start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT, and 𝐘Bsubscript𝐘𝐵\boldsymbol{\mathbf{Y}}_{B}bold_Y start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT are also shown as these are used to ray trace cast shadows (towards the light source at position 𝐏𝐏\boldsymbol{\mathbf{P}}bold_P with brightness ϕitalic-ϕ\phiitalic_ϕ). Cast shadow samples are marked as 𝐗sisubscript𝐗𝑠𝑖\boldsymbol{\mathbf{X}}_{si}bold_X start_POSTSUBSCRIPT italic_s italic_i end_POSTSUBSCRIPT, and 𝐘sisubscript𝐘𝑠𝑖\boldsymbol{\mathbf{Y}}_{si}bold_Y start_POSTSUBSCRIPT italic_s italic_i end_POSTSUBSCRIPT respectively. Note that points that significantly contribute to the total rendering (though the accumulated opacity) are coloured blue and points that do not (because they are outside of the surface or occluded) are marked red. For these red points, the free space regulariser is used. For shadow sample points rendering is not performed and so are marked black. Note that the average intersection points are only used to guide shadows so they are not rendered either. Finally, for the 𝐗r2subscript𝐗𝑟2\boldsymbol{\mathbf{X}}_{r2}bold_X start_POSTSUBSCRIPT italic_r 2 end_POSTSUBSCRIPT ray sample point, normal 𝐍𝐍\boldsymbol{\mathbf{N}}bold_N, lighting 𝐋𝐋\boldsymbol{\mathbf{L}}bold_L and viewing vectors 𝐕𝐕\boldsymbol{\mathbf{V}}bold_V (that are used for rendering) are showing with respective colors of red,green and blue.

Geometry parameterisation. Following the work of other neural volumetric approaches [43, 42], we parameterise the scene geometry as the zeroth level set of an implicit function F𝐹Fitalic_F corresponding to the Signed Distance Field, d=F(𝐗)𝑑𝐹𝐗d=F(\boldsymbol{\mathbf{X}})italic_d = italic_F ( bold_X ). The unknown function F𝐹Fitalic_F is a deep neural network and the objective is to optimise its weights. We note that the SDF of any arbitrary geometry is always continuous, almost everywhere differentiable and satisfies the Eikonal equation of unit magnitude gradient F(𝐗)=1norm𝐹𝐗1||\nabla F(\boldsymbol{\mathbf{X}})||=1| | ∇ italic_F ( bold_X ) | | = 1. Finally, we note that for surface points (where F(𝐗)=0𝐹𝐗0F(\boldsymbol{\mathbf{X}})=0italic_F ( bold_X ) = 0), the surface normal is the gradient of the SDF, i.e. 𝐍(𝐗)=F𝐍𝐗𝐹\boldsymbol{\mathbf{N}}(\boldsymbol{\mathbf{X}})=\nabla Fbold_N ( bold_X ) = ∇ italic_F. This allows for a direct SDF to rendering relationship and thus allowing to train the SDF through rendering loss from the initial photometric stereo images. In addition, if surface normal maps are available (e.g. from single view estimation networks) they can also be used as an addition training signal. We use the SIREN architecture [36] which is a MLP with sinusoidal activation functions and that guarantees that the surface is infinitely differentiable thus can be easily recovered from its derivatives. Indeed, [8, 25, 44] have used SIREN networks for single and binocular view PS as well as multi-view stereo so we are a direct extension to them specifically for the MV-PS under point light sources.

Ray sampling. We follow a volumetric sampling and rendering method similar to VolSDF [42] where the neural SDF is queried in multiple samples on outgoing rays from each image foreground pixel. For each pixel an estimate of the depth is available (initialised from single view PS and occasionally updated during training) and thus most of the samples are concentrated around that depth. However, to allow the surface to evolve and to minimse free space artefacts, additional samples are also sampled in a bigger depth range. As each point, the Laplace density function (see VOLSDF [42] ) is used to convert from SDF values to transparency and alpha blending is used to accumulate depth, normals and rendered intensity at each ray using the standard approach.

Thus, transparecny α=exp(tδr)𝛼exp𝑡𝛿𝑟\alpha=\text{exp}(-t\delta r)italic_α = exp ( - italic_t italic_δ italic_r ) with δr𝛿𝑟\delta ritalic_δ italic_r being the distance along the ray and so the intersection 𝐗Asubscript𝐗𝐴\boldsymbol{\mathbf{X}}_{A}bold_X start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT is computed as a weighted sum of ray samples 𝐩=i(wi𝐫i)𝐩subscript𝑖subscript𝑤𝑖subscript𝐫𝑖\boldsymbol{\mathbf{p}}=\sum_{i}(w_{i}\boldsymbol{\mathbf{r}}_{i})bold_p = ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ), with the sample weights wisubscript𝑤𝑖w_{i}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT corresponding to the accumulated opacity. We note that this average points 𝐗Asubscript𝐗𝐴\boldsymbol{\mathbf{X}}_{A}bold_X start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT is further used to compute cast shadows but it is not directly render. Similar averaging is used for the rendered intensity as well as surface normal which are used to compute losses. The ray sampling process is shown in Figure 3.

Learned BRDF. We follow the approach of [25] where the BRDF is also parameterised as another SIREN network and thus completely learned from the data. This assumes that the material properties are uniform around the target scene except for a scalar albedo variation. We emphasise that we chose to perform grayscale intensity rendering instead of full RGB as this is expected to minimise the synthetic to real gap. Indeed, real RGB images are usually acquired with demosaicing of single intensity values and this procedure is usually optimised to best recover intensity not colour (e.g see [6] used by OpenCV).

To minimise over-fitting, the material BRDF network receives as input only the relative angles between 𝐍𝐍\boldsymbol{\mathbf{N}}bold_N, 𝐋𝐋\boldsymbol{\mathbf{L}}bold_L and 𝐕𝐕\boldsymbol{\mathbf{V}}bold_V. In addition, to simplify the learning problem, we following the principles described in the MERL real material database [30]. For that, the half vector 𝐇=𝐋+𝐕|𝐋+𝐕|𝐇𝐋𝐕𝐋𝐕\boldsymbol{\mathbf{H}}=\frac{\boldsymbol{\mathbf{L}}+\boldsymbol{\mathbf{V}}}% {|\boldsymbol{\mathbf{L}}+\boldsymbol{\mathbf{V}}|}bold_H = divide start_ARG bold_L + bold_V end_ARG start_ARG | bold_L + bold_V | end_ARG is first computed and the input to the network is the relative angles between 𝐍𝐍\boldsymbol{\mathbf{N}}bold_N, 𝐋𝐋\boldsymbol{\mathbf{L}}bold_L and 𝐇𝐇\boldsymbol{\mathbf{H}}bold_H. Finally, we note that the final activation of SIREN part of the BRDF network is exponential and there is a post multiplication with an 𝐍𝐋𝐍𝐋\boldsymbol{\mathbf{N}}\cdot\boldsymbol{\mathbf{L}}bold_N ⋅ bold_L factor such as the networks learns a multiplicative factor over the diffuse reflectance, so thus we parameterise:

B(𝐍,𝐋m,𝐕)=(𝐍𝐋m) exp(SIREN(𝐍𝐇m,𝐍𝐋m,𝐇m𝐋m))\text{B}(\boldsymbol{\mathbf{N}},\boldsymbol{\mathbf{L}}_{m},\boldsymbol{% \mathbf{V}})=(\boldsymbol{\mathbf{N}}\cdot\boldsymbol{\mathbf{L}}_{m})\text{ % exp\big{(}SIREN}(\boldsymbol{\mathbf{N}}\cdot\boldsymbol{\mathbf{H}}_{m},% \boldsymbol{\mathbf{N}}\cdot\boldsymbol{\mathbf{L}}_{m},\boldsymbol{\mathbf{H}% }_{m}\cdot\boldsymbol{\mathbf{L}}_{m})\big{)}B ( bold_N , bold_L start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , bold_V ) = ( bold_N ⋅ bold_L start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) exp ( SIREN ( bold_N ⋅ bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , bold_N ⋅ bold_L start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ⋅ bold_L start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) ) (2)

Albedo. The scalar albedo ρ𝜌\rhoitalic_ρ is learned with another SIREN network which is queried for every sample point.

Shadow estimation. Handling cast shadows is imperative for any successful photometric stereo method. To estimate cast shadows for a point, we raytrace from that point to the light source following the direction of the lighting vectors 𝐥𝐦subscript𝐥𝐦\boldsymbol{\mathbf{l_{m}}}bold_l start_POSTSUBSCRIPT bold_m end_POSTSUBSCRIPT computed above. For each ray we take 16 random samples j𝑗jitalic_j [2mm,50mm]absent2mm50mm\in[2\text{mm},50\text{mm}]∈ [ 2 mm , 50 mm ]. For all these points, we query the SDF network and accumulate opacity following the same volumetric rendering procedure, i.e s=j(1αj)𝑠subscriptproduct𝑗1subscript𝛼𝑗s=\prod_{j}(1-\alpha_{j})italic_s = ∏ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( 1 - italic_α start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ). We note that this shadow computation procedure is very computationally expensive therefore it cannot be computed for all the rendered points. Instead, we only compute it for the average intersection point of each ray (𝐗Asubscript𝐗𝐴\boldsymbol{\mathbf{X}}_{A}bold_X start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT is Figure 3) and assume it constant for all the other ray samples.

Indirect reflectance. Finally, the indirect reflectance ir,msubscript𝑖𝑟𝑚i_{r,m}italic_i start_POSTSUBSCRIPT italic_r , italic_m end_POSTSUBSCRIPT (Equation 1) is approximated as ir,m=aρs~R(𝐗)subscript𝑖𝑟𝑚𝑎𝜌~𝑠𝑅𝐗i_{r,m}=a\rho\tilde{s}R(\boldsymbol{\mathbf{X}})italic_i start_POSTSUBSCRIPT italic_r , italic_m end_POSTSUBSCRIPT = italic_a italic_ρ over~ start_ARG italic_s end_ARG italic_R ( bold_X ). s~~𝑠\tilde{s}over~ start_ARG italic_s end_ARG is compute as the mean value of one minus shadows (over the number of lights) and its a measure of the local surface concavity; s~=0~𝑠0\tilde{s}=0over~ start_ARG italic_s end_ARG = 0 corresponds to no shadows and hence completely convex local geometry therefore no intereflection, s~=1~𝑠1\tilde{s}=1over~ start_ARG italic_s end_ARG = 1 corresponds to all lights being shaded and hence maximum concavity therefore we allow for maximum learned intereflection. Finally R(𝐗)𝑅𝐗R(\boldsymbol{\mathbf{X}})italic_R ( bold_X ) is queried from another SIREN network and we note that it is parameterised to be constant on all lights.

Some visualisation of the generated rendering for synthetic and real expeiments are shown in Figure 4.

3.3 Initialisation

As it is standard in competing approaches i.e. [40], we can use single view PS (at each view) in order to get normal and depth estimates. We start by computing per view normal maps using the state-of-the-art PS normal estimation network [9] and also use numerical integration [35] in order to get approximate depth maps. The depth maps do need to be accurate as they are only used to initialise the ray search spaces. The normal maps can be used in order to provided an additional training signal.

The SDF newtwork is initialised with weights that approximate the SDF of a perfect sphere. to speed up convergence, we always run 3 epochs with normal and silhouette loss without rendering (which is a lot more computationally expensive).

Final surface calculation. After the optimisation is completed, the SDF network can be sampled in a regular grid of points and a triangle mesh surface can be recovered using the standard marching cubes [29] algorithm. For these revered vetrices, the albedo network can be queried in order to obtained a textured reconstruction.

3.4 Losses

We use the following losses:

Rendering loss. We use L1 loss on the rendered intensities, i.e. Lrend=|irendigt|subscript𝐿𝑟𝑒𝑛𝑑subscript𝑖𝑟𝑒𝑛𝑑subscript𝑖𝑔𝑡L_{rend}=|i_{rend}-i_{gt}|italic_L start_POSTSUBSCRIPT italic_r italic_e italic_n italic_d end_POSTSUBSCRIPT = | italic_i start_POSTSUBSCRIPT italic_r italic_e italic_n italic_d end_POSTSUBSCRIPT - italic_i start_POSTSUBSCRIPT italic_g italic_t end_POSTSUBSCRIPT |. We note that to better balance the rendering data, each light source is scaled so as the maximum GT intenisty is 1 (saturated). This is performed because some of the DiliGenT-MV lights are very dark. The relative weighting of this loss when used is 1e3.

Silhouette loss. As it is standard in volumetric SDF approaches, we apply binary cross entropy loss to the predicted silhuettes to the ground truth ones. We generate predicted silhuettes with the total accumulated opacity αtotal=j(1αj)subscript𝛼𝑡𝑜𝑡𝑎𝑙subscriptproduct𝑗1subscript𝛼𝑗\alpha_{total}=\prod_{j}(1-\alpha_{j})italic_α start_POSTSUBSCRIPT italic_t italic_o italic_t italic_a italic_l end_POSTSUBSCRIPT = ∏ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( 1 - italic_α start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) (which should be 1 for foreground and 0 for background points), Lsil=BCE(αtotal,mask)subscript𝐿𝑠𝑖𝑙BCEsubscript𝛼𝑡𝑜𝑡𝑎𝑙𝑚𝑎𝑠𝑘L_{sil}=\text{BCE}(\alpha_{total},mask)italic_L start_POSTSUBSCRIPT italic_s italic_i italic_l end_POSTSUBSCRIPT = BCE ( italic_α start_POSTSUBSCRIPT italic_t italic_o italic_t italic_a italic_l end_POSTSUBSCRIPT , italic_m italic_a italic_s italic_k ). We consider a 5 pixel band outside the provided segmentation masks for background points. We note that we ignore the utmost pixels of the provided segmentation masks and also the innermost pixels of the background (leaving 2 pixel boundaries between foreground and background). This is done for better numerical stability as the exact discritisation of the boundaries is not very reliable. The relative weighting of this loss is 1e1.

Normal loss. We apply angular normal loss to match SDF computed normals 𝐍ssubscript𝐍𝑠\boldsymbol{\mathbf{N}}_{s}bold_N start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT to the network predicted normals 𝐍nsubscript𝐍𝑛\boldsymbol{\mathbf{N}}_{n}bold_N start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT. We follow [25, 26] and use angular error (as opposed to just L2 used in approaches like [40]) Ln=|atan2(𝐍n×𝐍s,𝐍n𝐍s)|subscript𝐿𝑛atan2normsubscript𝐍𝑛subscript𝐍𝑠subscript𝐍𝑛subscript𝐍𝑠L_{n}=|\text{atan2}(||\boldsymbol{\mathbf{N}}_{n}\times\boldsymbol{\mathbf{N}}% _{s}||,\boldsymbol{\mathbf{N}}_{n}\cdot\boldsymbol{\mathbf{N}}_{s})|italic_L start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = | atan2 ( | | bold_N start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT × bold_N start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT | | , bold_N start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ⋅ bold_N start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) | For rays where the accumulated opacity is less than 0.01 (i.e the ray does not intersect any surface), no normal loss is applied. In addition, following previous works, the normal loss is weighted by the obliqueness of each point (𝐍𝐕)𝐍𝐕(\boldsymbol{\mathbf{N}}\cdot\boldsymbol{\mathbf{V}})( bold_N ⋅ bold_V ) and that stops the optimisation to try to fit occlusion boundaries which are numerically unstable. The relative weighting of this loss when used is 1.

Eikonal loss. To enforce the Eikonal equation for all ray samples, we apply L1 loss which is the standard in most SDF approaches i.e. Leik=|d1|subscript𝐿𝑒𝑖𝑘norm𝑑1L_{eik}=|~{}||\nabla d||-1|italic_L start_POSTSUBSCRIPT italic_e italic_i italic_k end_POSTSUBSCRIPT = | | | ∇ italic_d | | - 1 |. The relative weighting of this loss is 1e1.

Free space regulariser. It is important to use some sort of a regulariser to minimise free space artifacts on the inside and outside of the object, in unseen regions. To do that, at each ray, the SDF needs to be encouraged to get more negative on the points behind the first ray/surface intersection, with penalty weight the different of the target sign stsubscript𝑠𝑡s_{t}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT (-1 for inside, +1 for outside) to the soft sign i.e. softsign(d)=d1+|d|softsign𝑑𝑑1𝑑\text{softsign}(d)=\frac{d}{1+|d|}softsign ( italic_d ) = divide start_ARG italic_d end_ARG start_ARG 1 + | italic_d | end_ARG . So we compute Lreg=|softsign(d)st|subscript𝐿𝑟𝑒𝑔softsign𝑑subscript𝑠𝑡L_{reg}=|\text{softsign}(d)-s_{t}|italic_L start_POSTSUBSCRIPT italic_r italic_e italic_g end_POSTSUBSCRIPT = | softsign ( italic_d ) - italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT |. The is graphically shown in Figure 3 where 𝐗r1subscript𝐗𝑟1\boldsymbol{\mathbf{X}}_{r1}bold_X start_POSTSUBSCRIPT italic_r 1 end_POSTSUBSCRIPT is infront of 𝐗Asubscript𝐗𝐴\boldsymbol{\mathbf{X}}_{A}bold_X start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT and thus has st=1subscript𝑠𝑡1s_{t}=1italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = 1, 𝐗r6subscript𝐗𝑟6\boldsymbol{\mathbf{X}}_{r6}bold_X start_POSTSUBSCRIPT italic_r 6 end_POSTSUBSCRIPT is behind and thus has st=1subscript𝑠𝑡1s_{t}=-1italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = - 1. Finally, to minimise the effect of regulariser on the surface evolution, we only apply it to points where their opacity contribution is less than 1% (i.e. far away from 𝐗Asubscript𝐗𝐴\boldsymbol{\mathbf{X}}_{A}bold_X start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT, colored red on Figure 3). The relative weighting of this loss is 1e-1.

4 Experiment Setup

This section explains the datasets used for evaluation of our method along with the evaluation metric and competitors.

4.1 Datasets

DiLiGenT-MV. Our main evaluation target is DiLiGenT-MV [21] containing 5 objects with 96 lights in 20 views. Images are 612×512612512612\times 512612 × 512 resolution with objects actually occupying a maximum of 400×400px400400px400\times 400\text{px}400 × 400 px. GT meshes, camera instrisics and extrinsics and normal maps are provided, as well as point light positions and far-field light brightness (ϕitalic-ϕ\phiitalic_ϕ). We note that these brightness were measure from the intensity of a flat calibration target roughly positioned at the object, so intrinsic brightness is recovered by multiplying with inverse distance square. We note that as these calibration data are unavailable, the ϕitalic-ϕ\phiitalic_ϕ is expected to be fairly inaccurate and thus is also optimised during training.

Dataset anomalies. In order to have the most fair assesement, we report the following dataset anomalies on DiLiGenT-MV and our attempts to overcome them. Firstly, the provided GT normal maps are incompatible with the GT meshes as well as the provided intrinsics and extrinsics. The best compatibility was achieved by using the intrinsic matrix for Reading for all objects while simultaneously performing SDV to fix the non-one determinant on the provided rotation matrices (especially for Reading). Despite that, there is still impossible to get a perfect match between provided normals and meshes so the exact parameters remain uncertain. For all experiments expect the first line of Table 2, we use the modified parameters (Reading instrisics and endorsing unit determinant on rotation matrices R with SVD).

In addition, the provided segmentation masks in Bear and Cow contained holes that had to be closed manually otherwise the silhouette loss would make big holes on the reconstructed meshes.

Finally, Ikehata in [11] first noticed that the first 20 images of the Bear appear to be corrupted. We also found more similarly corrupted images on other views (more visualisation in the supplementary) and did out best effort to manually mark and ignore them but it is possible that more image corruptions are still unnoticed.

Synthetic DiLiGenT-MV. To better demonstrate the effectiveness of our method without the real data corruptions discussed above, we rendered a synthetic version of DiLiGenT-MV with Blender. See Figure 4. We use the exact same objects, with the exact same poses and rendered the 96 points lighst. The objects materials where chosen to loosely mimic real objects and the albedo was set to a random pattern. Finally, we note that this synthetic data can be used to visualise shadow and indirect reflection maps which are really hard to correctly evaluate on real data. This is explored in the supplementary.

4.2 Hyperparameters and Training

We use a tensorflow port of SIREN for all the experiments. The SDF network is set to 5x512 layers (1.05M parameters) and the albedo and self reflection network to 3x256 (133K parameters). The BRDF network is set to 3x32 layers (2.5K parameters). We use 64 ray samples in a 100mm ray range and an additional 64 around the average intersection (in a shrinking distance range up to 10mm) and 64 extra samples computed with one step of Newton method (for approximating the 0 of the SDF). For each shadow ray we used 16 samples. We train with batch size of 512 rays for 100 epochs which takes approx. 20h on a NVIDIA TITAN RTX and 17GB of RAM when rendering all 96 lights. Note that the 6 lights versions completes in only 5h.

4.3 Evaluation protocol

We evaluate our method by computed Chamfer distance (marked as surface error SE) of the reconstructions and the ground truth . This is computed as the average of Hausdorf distance from reconstruction to ground truth and the opposite, with the distances computed with with Meshlab. We note that in order to have a fair comparison and not bias the error with unseen bottom of the objects, the bottom 6mm of GTs and all reconstructions are removed. All DiLiGenT-MV objects are aligned to be touching the XY plane so the crop** is straightforward. This crop** also avoids large error at some parts of the bottom of the objects that are occluded by the background (e.g. the feet of Reading, see supplementary).

Competing approaches. We compare against DiLiGeNT [21], PS-NeRF [40], MVAS [2] and SuperNormal [1]. For all of the methods, the output reconstructions were downloaded from the SuperNormal project page111https://xucao-42.github.io/homepage/. Synthetic data is only used to ablate our own method as the DiLiGeNT and SuperNormal code is not available (and our method using only norma loss is very similar to PS-NeRF and SuperNormal). We also note that PS-NeRF and MVAS are also using a modified/pre-procesed version of DiLiGeNT and this is potentially alleviating some of the anomalies (especially the segmentation mask holes).

5 Experiments

Refer to caption
Figure 4: Qualitative visualisation of re-rendering and rendering errors for synthetic and real data. We note that the scaling of the error map sets red to 0.1.

We describe two sets of experiments on synthetic data and real data in Sections 5.1 and 5.2 correspondingly.

5.1 Synthetic data

As mentioned in Section 4.1 the original DiLiGenT-MV dataset contains several anomalies making hard to correctly ablate the several steps of our method. Thus ablation experiments are performed on Synthetic-DiLiGenT-MV dataset described in Section 4.

Method Bear Buddha Cow Pot2 Reading Average SE
GT Normals 0.06 0.08 0.03 0.04 0.02 0.05
Ours - [normals. only] 0.17 0.14 0.10 0.14 0.16 0.14
Ours - [intensities only] 0.13 0.29 0.07 0.20 0.11 0.16
Ours - [normals. + intensities.] 0.11 0.17 0.04 0.18 0.11 0.12
Table 1: Ablation study on DiLiGenT synthetic data. Using normals and intensities for shape estimation achieves the best error. Particularly note that it outperforms both other configurations on Bear and Cow objects indicating the the combined approach is better than a simple interpolation between the two.

We first show that our network can achieve a very low error of 0.05mm when using the ground truth normals which is not the case for the real DiLiGenT-MV dataset as shown in Table 2. We also show that if predicted normals are used the performance is worse but still quite good (ie. 0.14mm). Using intensities only achieves overall accuracy of 0.16mm. In addition, using intensities only significantly outperforms the normals only network on Bear, Cow, Reading but it is inferior on Buddha and Pot2. The reason for such a differing performance is the distribution of self reflection that affect the rendering performance more than the normal estimation network. The combined normals and intensities experiment achieves the best accuracy of 0.12mm average error. In addition, on Bear and Cow objects, the accuracy using both losses is better than any of the individual ones (0.17, 0.13 vs 0.11 for Bear) and (0.10, 0.07 vs 0.04 for Cow) therefore showing the ability to combine best of both training signals.

5.2 Real data

In this section we report our results on DiLiGenT-MV benchmark. We first show that there some issues with DiLiGenT-MV intrinsic camera parameters and/or ground truth normals. In particular, we compute the shape estimation error by using original intrinsic parameters and obtain a significantly higher error of estimated shape than on the synthetic DiLiGenT-MV benchmark. However if we take the intrinsic parameters from Reading object and use it for all objects, the error is drastically reduced (0.11 to 0.03). We note that for the Reading object, the difference of 0.06 to 0.05 is explained by adjusting the rotation matrices with SVD.

When looking at our network performance we show that we achieve state of the art results, only matching to the original DiLiGenT-MV benchmark and significantly outperform deep learning based competitors (ie. PS-NeRF [40] and MVAS [2]). Note in the original paper of PS-NeRF the performance was compared by including the bottom (invisible) part of the object which gave misleading results of deep learning method outperforming the classical one. We believe that the classical method performs so well as the objects have relatively simple geometry (especially the Cow where is mostly convex and has the least amount of shadows and self reflection). The reconstructed meshes are shown in Figure 5.

Refer to caption
Figure 5: Qualitative results on real DiLiGenT-MV benchmark. The error bars are set to red corresponding to 1mm. We note that we are achieving consistent, uniform accuracy on all regions of all objects, including the concavity in the middle of Reading.

Note we also show that if we reduce the number of lights available (e.g. 6) and hence decrease the quality of normals reduces and the impact of rendering increases, justifying our key contribution. If a more challenging multi-view photometric stereo benchmark were to be available our approach would have display a larger gain.

Method Bear Buddha Cow Pot2 Reading Average SE
GT normals 0.13 0.15 0.10 0.12 0.06 0.11
GT normals (corr. params) 0.02 0.06 0.01 0.02 0.05 0.03
DiLiGeNT [21] 0.22 0.33 0.08 0.21 0.25 0.22
PS-NeRF [40] 0.27 0.33 0.27 0.26 0.36 0.30
MVAS [2] 0.25 0.37 0.21 0.20 0.52 0.31
SuperNormal [1] 0.21 0.21 0.19 0.15 0.21 0.19
Ours - [normals only] - 6 lights 0.59 0.46 0.34 0.63 0.45 0.49
Ours - [intensities only] - 6 lights 0.22 0.29 0.22 0.32 0.32 0.27
Ours - [normals + intensities] - 6 lights 0.25 0.38 0.24 0.52 0.33 0.34
Ours - [normals only] 0.29 0.19 0.17 0.19 0.29 0.22
Ours - [intensities only] 0.25 0.24 0.18 0.34 0.26 0.26
Ours - [normals + intensities] 0.21 0.19 0.17 0.20 0.22 0.20
Table 2: Results on DiLiGenT-MV [21] benchmark. For all objects we report the mean shape error as well as average shape error and average median shape error on all objects. Normals are computed using the universal PS method of [9].

Finally, we achieve slightly inferior performance to SuperNormal [1], mainly because of the Pot2 object (0.2 vs 0.15mm). However given the discovered issues with the DiLiGenT-MV benchmark and code (or the normal maps used for input) not being available for this method we refrain from drawing on concusions for the reasons why it slightly outperforms our method. One possible explanation is there are more locally corrupted images (like the specular highlights on Bear) that are interfering with are renderer but do not pose issue to their particular normal estimation network. Note that SuperNormal method itself is functionally very similar to ours when no intensities (ie SuperNormal makes use of an SDF trained with normals, silhouette and Eikonal losses) are used.

6 Conclusions

In this work we proposed a novel multi-view photometric stereo method. Unlike most MVPS methods our approach explicity leverages per-pixel intensity renderings rather than relying mainly on estimated normals. We believe such approach is required for truly applicable and robust MVPS as the estimated normals are likely to fail on complex materials or geometries. We clearly demonstrate the benefit of leveraging intensities on a synthetic and real DiLiGenT-MV benchmark and the applicability of our method on the minimal 6 lighst case.

References

  • [1] Cao, X., Taketomi, T.: Supernormal: Neural surface reconstruction via multi-view normal integration. CVPR (2024)
  • [2] Cao X., Santo H., O.F., Y., M.: Multi-view azimuth stereo via tangent space consistency. In Proc. of Computer Vision and Pattern Recognition (CVPR) (2023)
  • [3] Chen, G., Han, K., Shi, B., Matsushita, Y., Wong, K.Y.K.: Sdps-net: Self-calibrating deep photometric stereo networks. In: CVPR (2019)
  • [4] Du, H., Goldman, D.B., Seitz, S.M.: Binocular photometric stereo. In: BMVC (2011)
  • [5] Esteban, C.H., Vogiatzis, G., Cipolla, R.: Multiview photometric stereo. PAMI (2008)
  • [6] Getreuer, P.: Malvar-he-cutler linear image demosaicking. Image Processing on Line 1, 83–89 (2011)
  • [7] Goldman, D.B., Curless, B., Hertzmann, A., Seitz, S.M.: Shape and spatially-varying brdfs from photometric stereo. PAMI (2010)
  • [8] Guo, H., Santo, H., Shi, B., Matsushita, Y.: Edge-preserving near-light photometric stereo with neural surfaces. arXiv (2022)
  • [9] Hardy, C., Quéau, Y., Tschumperlé, D.: Uni MS-PS: a Multi-Scale Encoder Decoder Transformer for Universal Photometric Stereo (Feb 2024), https://hal.science/hal-04431103, working paper or preprint
  • [10] Hui, Z., Sankaranarayanan, A.C.: Shape and spatially-varying reflectance estimation from virtual exemplars. IEEE Trans. Pattern Analysis and Machine Intelligence (PAMI) 39(10), 2060–2073 (2017)
  • [11] Ikehata, S.: Cnn-ps: Cnn-based photometric stereo for general non-convex surfaces. In: ECCV (2018)
  • [12] Ikehata, S.: Universal photometric stereo network using global lighting contexts. CVPR (2022)
  • [13] Ikehata, S.: Scalable, detailed and mask-free universal photometric stereo. CVPR (2023)
  • [14] Ju, Y., Zhang, C., Huang, S., Rao, Y., Lam, K.M.: Learning deep photometric stereo network with reflectance priors. In: ICME (2023)
  • [15] Kaya, B., Kumar, S., Oliveira, C., Ferrari, V., Van Gool, L.: Uncertainty-aware deep multi-view photometric stereo. In: CVPR (2022)
  • [16] Kaya, B., Kumar, S., Oliveira, C., Ferrari, V., Van Gool, L.: Multi-view photometric stereo revisited. In: WACV (2023)
  • [17] Kaya, B., Kumar, S., Sarno, F., Ferrari, V., Van Gool, L.: Neural radiance fields approach to deep multi-view photometric stereo. In: WACV (2021). https://doi.org/10.48550/ARXIV.2110.05594, https://arxiv.longhoe.net/abs/2110.05594
  • [18] Kong, H., Xu, P., Teoh, E.K.: Binocular uncalibrated photometric stereo. In: ISCV (2006)
  • [19] Li, J., Li, H.: Neural reflectance for shape recovery with shadow handling. In: CVPR. pp. 16221–16230 (2022)
  • [20] Li, J., Li, H.: Self-calibrating photometric stereo by neural inverse rendering. In: ECCV. Springer (2022)
  • [21] Li, M., Zhou, Z., Wu, Z., Shi, B., Diao, C., Tan, P.: Multi-view photometric stereo: A robust solution and benchmark dataset for spatially varying isotropic materials. IEEE Trans. Image Process. (2020)
  • [22] Li, Z., Müller, T., Evans, A., Taylor, R.H., Unberath, M., Liu, M.Y., Lin, C.H.: Neuralangelo: High-fidelity neural surface reconstruction. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2023)
  • [23] Liu, Y., Wang, P., Lin, C., Long, X., Wang, J., Liu, L., Komura, T., Wang, W.: Nero: Neural geometry and brdf reconstruction of reflective objects from multiview images. In: SIGGRAPH (2023)
  • [24] Logothetis, F., Budvytis, I., Cipolla, R.: A neural height-map approach for the binocular photometric stereo problem. In: WACV (2023)
  • [25] Logothetis, F., Budvytis, I., Cipolla, R.: A neural height-map approach for the binocular photometric stereo problem. WACV (2024)
  • [26] Logothetis, F., Budvytis, I., Mecca, R., Cipolla, R.: PX-NET: Simple, Efficient Pixel-Wise Training of Photometric Stereo Networks. In: ICCV (2021)
  • [27] Logothetis, F., Mecca, R., Budvytis, I., Cipolla, R.: A cnn based approach for the point-light photometric stereo problem. IJCV (2022)
  • [28] Logothetis, F., Mecca, R., Cipolla, R.: A differential volumetric approach to multi-view photometric stereo. In: ICCV (2019)
  • [29] Lorensen, W.E., Cline, H.E.: Marching cubes: A high resolution 3d surface construction algorithm. In: Seminal graphics: pioneering efforts that shaped the field, pp. 347–353 (1998)
  • [30] Matusik, W., Pfister, H., Brand, M., McMillan, L.: A data-driven reflectance model. ACM TOG (2003)
  • [31] Mecca, R., Wetzler, A., Bruckstein, A., Kimmel, R.: Near Field Photometric Stereo with Point Light Sources. SIAM Journal on Imaging Sciences (2014)
  • [32] Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: Nerf: Representing scenes as neural radiance fields for view synthesis. In: ECCV (2020)
  • [33] Nießner, M., Zollhöfer, M., Izadi, S., Stamminger, M.: Real-time 3d reconstruction at scale using voxel hashing. ACM Trans. Graph. (2013)
  • [34] Park, J., Sinha, S.N., Matsushita, Y., Tai, Y.W., Kweon, I.S.: Robust multiview photometric stereo using planar mesh parameterization. IEEE Transactions on Pattern Analysis and Machine Intelligence 39(8), 1591–1604 (2017)
  • [35] Quéau, Y., Durou, J.D.: Edge-preserving integration of a normal field: Weighted least squares, TV and L1 approaches. In: SSVM (2015)
  • [36] Sitzmann, V., Martel, J.N., Bergman, A.W., Lindell, D.B., Wetzstein, G.: Implicit neural representations with periodic activation functions. In: NeurIPS (2020)
  • [37] Verbin, D., Hedman, P., Mildenhall, B., Zickler, T., Barron, J.T., Srinivasan, P.P.: Ref-nerf: Structured view-dependent appearance for neural radiance fields. CVPR (2022). https://doi.org/10.48550/ARXIV.2112.03907, https://arxiv.longhoe.net/abs/2112.03907
  • [38] Wang, C., Wang, L., Matsushita, Y., Huang, B., Chen, M., Soong, F.K.: Binocular photometric stereo acquisition and reconstruction for 3d talking head applications. In: INTERSPEECH (2013)
  • [39] Woodham, R.J.: Determining surface curvature with photometric stereo. In: ICRA (1989)
  • [40] Yang, W., Chen, G., Chen, C., Chen, Z., Wong, K.Y.K.: Ps-nerf: Neural inverse rendering for multi-view photometric stereo. In: ECCV (2022)
  • [41] Yang, W., Chen, G., Chen, C., Chen, Z., Wong, K.Y.K.: S3-NeRF: Neural reflectance field from shading and shadow under a single viewpoint. In: Oh, A.H., Agarwal, A., Belgrave, D., Cho, K. (eds.) NeurIPS (2022), https://openreview.net/forum?id=tvwkeAIcRP8
  • [42] Yariv, L., Gu, J., Kasten, Y., Lipman, Y.: Volume rendering of neural implicit surfaces. In: Thirty-Fifth Conference on Neural Information Processing Systems (2021)
  • [43] Yariv, L., Kasten, Y., Moran, D., Galun, M., Atzmon, M., Ronen, B., Lipman, Y.: Multiview neural surface reconstruction by disentangling geometry and appearance. Advances in Neural Information Processing Systems 33, 2492–2502 (2020)
  • [44] Zhang, J., Yao, Y., Li, S., Liu, J., Fang, T., McKinnon, D., Tsin, Y., Quan, L.: Neilf++: Inter-reflectable light fields for geometry and material estimation. In: International Conference on Computer Vision (ICCV) (2023)
  • [45] Zhao, D., Lichy, D., Perrin, P.N., Frahm, J.M., Sengupta, S.: Mvpsnet: Fast generalizable multi-view photometric stereo. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 12525–12536 (October 2023)
  • [46] Zhou, Z., Wu, Z., Tan, P.: Multi-view photometric stereo with spatially varying isotropic materials. In: IEEE Conference on Computer Vision and Pattern Recognition (2013)
  • [47] Zollhöfer, M., Dai, A., Innmann, M., Wu, C., Stamminger, M., Theobalt, C., Nießner, M.: Shading-based refinement on volumetric signed distance functions. ACM Trans. Graph. (2015)

Appendix 0.A Appendix

This document contains supplementary material. It is organised as follows: DiLiGenT-MV data anomalies in Section 0.A.1 and visualisation of shadows and indirect reflectance on synthetic data in Section 0.A.2.

0.A.1 DiLiGenT-MV data anomalies

This section gives additional information about identified data anomalies in DiLiGenT-MV data. First of all, we report that the rotation matrices for the Reading object do not have determinant 1 (as valid rotation matrices should). For example, for view 1 this is shown in Equation 3:

R1=[0.02381.00310.01370.45300.02300.89120.89120.01500.4533] with det(R1)=1.0035subscript𝑅1matrix0.02381.00310.01370.45300.02300.89120.89120.01500.4533 with 𝑑𝑒𝑡𝑅11.0035R_{1}=\begin{bmatrix}0.0238&1.0031&-0.0137\\ 0.4530&-0.0230&-0.8912\\ -0.8912&0.0150&-0.4533\\ \end{bmatrix}\text{ with }det(R1)=1.0035italic_R start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = [ start_ARG start_ROW start_CELL 0.0238 end_CELL start_CELL 1.0031 end_CELL start_CELL - 0.0137 end_CELL end_ROW start_ROW start_CELL 0.4530 end_CELL start_CELL - 0.0230 end_CELL start_CELL - 0.8912 end_CELL end_ROW start_ROW start_CELL - 0.8912 end_CELL start_CELL 0.0150 end_CELL start_CELL - 0.4533 end_CELL end_ROW end_ARG ] with italic_d italic_e italic_t ( italic_R 1 ) = 1.0035 (3)

In addition, we note that the intrinsic matrix is different for the Reading object than the rest with the difference being in the x axis focal length as well as the principal point as shown in Equation 5:

KReadingsubscript𝐾𝑅𝑒𝑎𝑑𝑖𝑛𝑔\displaystyle K_{Reading}italic_K start_POSTSUBSCRIPT italic_R italic_e italic_a italic_d italic_i italic_n italic_g end_POSTSUBSCRIPT =[3759.10305.503759255.5001]absentmatrix3759.10305.503759255.5001\displaystyle=\begin{bmatrix}3759.1&0&305.5\\ 0&3759&255.5\\ 0&0&1\\ \end{bmatrix}= [ start_ARG start_ROW start_CELL 3759.1 end_CELL start_CELL 0 end_CELL start_CELL 305.5 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 3759 end_CELL start_CELL 255.5 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 1 end_CELL end_ROW end_ARG ] (4)
Krestsubscript𝐾𝑟𝑒𝑠𝑡\displaystyle K_{rest}italic_K start_POSTSUBSCRIPT italic_r italic_e italic_s italic_t end_POSTSUBSCRIPT =[3772.10305.87503759255.875001]absentmatrix3772.10305.87503759255.875001\displaystyle=\begin{bmatrix}3772.1&0&305.875\\ 0&3759&255.875\\ 0&0&1\\ \end{bmatrix}= [ start_ARG start_ROW start_CELL 3772.1 end_CELL start_CELL 0 end_CELL start_CELL 305.875 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 3759 end_CELL start_CELL 255.875 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 1 end_CELL end_ROW end_ARG ] (5)

As we show in Table 2 of the main submission, the best performance and compatibility with the supplied GT normal maps was achieved with using the Reading intrinsics for all objects, as well as fixing the scaling in rotation matrices with SVD.

In addition, we also note that on the Bear object various images appear to be corrupted as shown in Figure 6. This has been a known issue for the first view (firstly noted by Ikehata in [11]) but we found corrupted images in other views. As most of the images are very dark, this is not easy to notice unless the brightness is adjusted.

Finally, we note that the background seems to be occluding part of the bottom for some objects as shown in Figure 7. This justifies our choise of removing the bottom 6mm of all objects for all methods in evaluating reconstruction accuracy.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 6: Example of corrupted images on DiLiGenT-MV data. From left to right view 1 lights 1 and 10, view 15 lights 48 and 64 (for the Bear object). Top row contains the original images, bottom row contains brightened up grayscale versions that make visualisation easier. It is clear that there is some data corruption around the specular highlights.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 7: Example of background occluding the bottom part of Cow (left) and Reading (right) objects. We show brightened average RGB image as well as full image normal maps (computed with Uni MS-PS [9]) to better visualise this issue.

0.A.2 DiLiGenT-MV-Synthetic visualisation

This section contains additional visualisations for the relationship between shadows and indirect reflectance which is obtained by investigating the synthetic version of DiLiGenT-MV. This is shown in Figure 8 where it is shown that in regions with a high number of shadows (i.e. ambient occlusion s~~𝑠\tilde{s}over~ start_ARG italic_s end_ARG ), indirect reflectance is higher.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 8: Visualisation of the relationship between shadows and indirect illumination (self reflection) on the synthetic version of DiLiGenT-MV. First column shows the ambient occlusion estimate s~~𝑠\tilde{s}over~ start_ARG italic_s end_ARG and second and third column show the mean and standard deviation (over all lights) of the attenuation compensated indirect reflectance ir/asubscript𝑖𝑟𝑎i_{r}/aitalic_i start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT / italic_a.