(eccv) Package eccv Warning: Package ‘hyperref’ is loaded with option ‘pagebackref’, which is *not* recommended for camera-ready version

¹¹institutetext: Toshiba Europe
¹¹email: [email protected]
²²institutetext: University of Cambridge
²²email: [email protected]

NPLMV-PS: Neural Point-Light Multi-View Photometric Stereo

Fotios Logothetis 11 Ignas Budvytis 11 Roberto Cipolla 1122

Abstract

In this work we present a novel multi-view photometric stereo (PS) method. Like many works in 3D reconstruction we are leveraging neural shape representations and learnt renderers. However, our work differs from the state-of-the-art multi-view PS methods such as PS-NeRF [40] or SuperNormal [1] we explicity leverage per-pixel intensity renderings rather than relying mainly on estimated normals.

We model point light attenuation and explicitly raytrace cast shadows in order to best approximate each points incoming radiance. This is used as input to a fully neural material renderer that uses minimal prior assumptions and it is jointly optimised with the surface. Finally, estimated normal and segmentation maps can also incorporated in order to maximise the surface accuracy.

Our method is among the first to outperform the classical approach of DiLiGenT-MV and achieves average 0.2mm Chamfer distance for objects imaged at approx 1.5m distance away with approximate $400\times 400$ resolution. Moreover, we show robustness to poor normals in low light count scenario, achieving 0.27mm Chamfer distance when pixel rendering is used instead of estimated normals.

Keywords:

Multiview Photometric Stereo Neural Rendering Neural Point Light Rendering

1 Introduction

Photometric Stereo (PS) is a long standing and very important problem in the field of Computer Vision problems. While early PS works [39, 11, 27, 13] primarily tackled the estimation of normals from a single view image, their potential was really unlocked by binocular [18, 4, 38, 24] and multi-view [5, 46, 34, 28, 21, 40] stereo setups as it allowed for a more accurate recovery of shape and not only normals, opening multiple applications such as general 3D reconstruction of objects for novel-view rendering, relighting or material editing [40], robot interaction, quality control in manufacturing or industrial conveyor belt scanning.

Refer to caption — Figure 1: In this figure we demonstrate the fragility of relying mainly on estimated normals (using [9]) for deep learning based multi-view photometric stereo. The first column shows the ground truth of Buddha object from DiLiGenT-MV benchmark. The second column shows the estimated normals using only 6 out of 96 lights available (view 1 mean average normal error is $13.5^{\circ}$ ). Using such normals leads to a large error in reconstruction (0.49mm). If intensities are used instead an error of 0.29mm is achieved. The right column shows the result using all lights and we note that we outperform DiLiGenT-MV [21], PS-NeRF [40] and MVAS [2] and only loose to SuperNormal [1], as shown in Table 2.

Along with increasing the number of views Photometric Stereo undergone another important change by moving from classical non-linear optimisation enabled inverse graphics approaches (for single view [10], binocular [18], multi-view [28, 21]) to neural network (e.g. [11, 26]) and in particular neural representation enabled inverse graphics approaches (for single view [8], binocular [24] and multi-view [40, 1]). Despite the latter methods (especially [1]) achieving impressive accuracy on DiLiGenT-MV [21] benchmark their approaches are somewhat unsatisfactory. In particular, [1] does not explicitly use image intensities to optimise for shape and it is fully reliant on per-view normal maps. Whereas, PS-NeRF [40] only uses average intensity during the surface optimisation stage and thus leaves most of the photometric information unused.

In this work we provide the first neural multi-view photometric stereo approach which fully leverages the availability of pixel intensity information for estimating 3D shape from Photometric Stereo images (see Figure 5). We achieve this by explicitly modeling the incident light from point light sources to leverage intensity based shape optimisation over purely normal driven shape optimisation [40, 1] which is fragile to incorrectly estimated normals especially in cases of very few available lights.

We model point light attenuation, explicitly raytrace cast shadows and also optimise a fully neural material renderer. Thus we are able to achieve competitive reconstruction accuracy using only 6 lights and match the SOTA performance when all lights and normal map information is also fused.

The remainder of this paper is organised as follows. Section 2 discusses the related works in Photometric Stereo and Multi-View Stereo. It is followed by a description of our method in Section 3. The experimental setup and experiment results are described in Sections 4 and 5 respectively.

2 Related Work

There is an extensive literature on single and multi-view photometric stereo and we review the following cases:

Single view photometric stereo. The first successful deep learning based single view PS was CNN-PS [11] which was extended by [26] and [27] to be applicable to general calibrated point like configurations. Other works like [7, 5, 14] have used material reflectance priors (using specific BRDFs like Lambertian or Ward) for single view normal prediction. Other recent approaches have also tackled a more uncalibrated setting like [3, 20, 19, 41] and more recently the idea of universal uncalibrated PS [12, 13] and [9].

Multi-view photometric stereo. Classical optimisation approaches have used triangle meshes [34] or sign distance function based parameterisations [28, 33, 47] to tackle the multi-view PS problem, under diffuse reflectance. More advanced methods such as [21] are more applicable to more general materials.

Neural surfaces. Recently, neural surface approaches have become very popular from the introduction of NeRF [32] and its extension to neural SDF parameterisations in [43, 42]. The first approaches to used neural SDFs for specifically the multi-view PS problem are [17, 15, 16, 45], PS-NeRF [40] and SuperNormal [1]. However, these approaches are mainly relying on normal fusion and do not attempt to update the neural surface though rendering. Other recent approaches have advanced the sophistication of the rendering methods to be more structured and thus respect the physics of light reflection more, like Ref-NeRF [37], Neuralangelo [22], NERO [23] and NeILF++ [44] but non of these methods has yet to be applied to PS problem, especially lacking the ability to model point light illumination. Finally it is worth mentioning [8] who introduced the idea of a infinitely differentiable surface (SIREN [36]) with Lambertian rendering for the single view PS and [25], which extended this method to the binocular setting and also added a fully learnable general material renderer.

We borrow the material renderer from [25] while extending the approach to work in the multi-view setting using an SDF paremeterisation similar to [43, 42].

3 Method

This section describes our method for solving the point light multi-view photometer stereo. A high level overview is also shown in Figure 2. Our method is primarily an inverse neural rendering method. Section 3.1 describes the assumed irradiance model. Section 3.2 describes the underlying neural surface parametersiation and Section 3.4 describes training losses used.

3.1 Irradiance equation

We now explain the assumed irradiance of a world point $\boldsymbol{\mathbf{X}}$ with surface normal $\boldsymbol{\mathbf{N}}$ and albedo $\rho$ . We assume point light sources $m$ at positions $\mathbf{P}_{m}$ which generate variable lighting vectors $\mathbf{L}_{m}=\mathbf{P}_{m}-\mathbf{X}$ . In addition, point light propagation results to the following attenuation factor $a_{m}=\frac{\phi_{m}}{||\mathbf{L}_{m})||^{2}}$ where $\phi_{m}$ is the intrinsic brightness of the light source. We note that the literature [31] usually also assumes angular dissipation factor but these calibration numbers are unavailable for DiLiGenT-MV [21] therefore we opt for the simpler, perfect point light source model. Thus, the reflected intensity of that point (not include $\boldsymbol{\mathbf{X}}$ for clarity) for the $m$ th light source $i_{m}$ is:

i_{m}=s_{m}a_{m}\rho B(\boldsymbol{\mathbf{N}},\boldsymbol{\mathbf{L}}_{m},% \boldsymbol{\mathbf{V}})+i_{r,m}

(1)

B is assumed to be a general BRDF, $s\in\{0,1\}$ is an indicator variable to account for cast shadows that completely block direct reflectance. $i_{r,m}$ corresponds to indirect reflectance (e.g. self reflections) and we will further make the simplifying assumption that it is also proportional to the surface albedo as well as the light source attenuation. Finally, we note that self reflection occur on concave parts of an objects surface which also correlate with more cast shadows therefore we make the following assumption $i_{r}\propto(a\rho\sum(1-s))$ . Indeed, the term ambient occlusion is usually used to describe the total obliqness of a surface point and this is approximated with the number of shadows. To simplify the learning problem, we assume that $i_{r}$ is constant between different lights thus $i_{r,m}=a\rho\sum sR(\boldsymbol{\mathbf{X}})$ . This assumption is further justified empirically on the supplementary by empirical measurements on Blender-rendered data.

3.2 Neural SDF

Geometry parameterisation. Following the work of other neural volumetric approaches [43, 42], we parameterise the scene geometry as the zeroth level set of an implicit function $F$ corresponding to the Signed Distance Field, $d=F(\boldsymbol{\mathbf{X}})$ . The unknown function $F$ is a deep neural network and the objective is to optimise its weights. We note that the SDF of any arbitrary geometry is always continuous, almost everywhere differentiable and satisfies the Eikonal equation of unit magnitude gradient $||\nabla F(\boldsymbol{\mathbf{X}})||=1$ . Finally, we note that for surface points (where $F(\boldsymbol{\mathbf{X}})=0$ ), the surface normal is the gradient of the SDF, i.e. $\boldsymbol{\mathbf{N}}(\boldsymbol{\mathbf{X}})=\nabla F$ . This allows for a direct SDF to rendering relationship and thus allowing to train the SDF through rendering loss from the initial photometric stereo images. In addition, if surface normal maps are available (e.g. from single view estimation networks) they can also be used as an addition training signal. We use the SIREN architecture [36] which is a MLP with sinusoidal activation functions and that guarantees that the surface is infinitely differentiable thus can be easily recovered from its derivatives. Indeed, [8, 25, 44] have used SIREN networks for single and binocular view PS as well as multi-view stereo so we are a direct extension to them specifically for the MV-PS under point light sources.

Ray sampling. We follow a volumetric sampling and rendering method similar to VolSDF [42] where the neural SDF is queried in multiple samples on outgoing rays from each image foreground pixel. For each pixel an estimate of the depth is available (initialised from single view PS and occasionally updated during training) and thus most of the samples are concentrated around that depth. However, to allow the surface to evolve and to minimse free space artefacts, additional samples are also sampled in a bigger depth range. As each point, the Laplace density function (see VOLSDF [42] ) is used to convert from SDF values to transparency and alpha blending is used to accumulate depth, normals and rendered intensity at each ray using the standard approach.

Thus, transparecny $\alpha=\text{exp}(-t\delta r)$ with $\delta r$ being the distance along the ray and so the intersection $\boldsymbol{\mathbf{X}}_{A}$ is computed as a weighted sum of ray samples $\boldsymbol{\mathbf{p}}=\sum_{i}(w_{i}\boldsymbol{\mathbf{r}}_{i})$ , with the sample weights $w_{i}$ corresponding to the accumulated opacity. We note that this average points $\boldsymbol{\mathbf{X}}_{A}$ is further used to compute cast shadows but it is not directly render. Similar averaging is used for the rendered intensity as well as surface normal which are used to compute losses. The ray sampling process is shown in Figure 3.

Learned BRDF. We follow the approach of [25] where the BRDF is also parameterised as another SIREN network and thus completely learned from the data. This assumes that the material properties are uniform around the target scene except for a scalar albedo variation. We emphasise that we chose to perform grayscale intensity rendering instead of full RGB as this is expected to minimise the synthetic to real gap. Indeed, real RGB images are usually acquired with demosaicing of single intensity values and this procedure is usually optimised to best recover intensity not colour (e.g see [6] used by OpenCV).

To minimise over-fitting, the material BRDF network receives as input only the relative angles between $\boldsymbol{\mathbf{N}}$ , $\boldsymbol{\mathbf{L}}$ and $\boldsymbol{\mathbf{V}}$ . In addition, to simplify the learning problem, we following the principles described in the MERL real material database [30]. For that, the half vector $\boldsymbol{\mathbf{H}}=\frac{\boldsymbol{\mathbf{L}}+\boldsymbol{\mathbf{V}}}% {|\boldsymbol{\mathbf{L}}+\boldsymbol{\mathbf{V}}|}$ is first computed and the input to the network is the relative angles between $\boldsymbol{\mathbf{N}}$ , $\boldsymbol{\mathbf{L}}$ and $\boldsymbol{\mathbf{H}}$ . Finally, we note that the final activation of SIREN part of the BRDF network is exponential and there is a post multiplication with an $\boldsymbol{\mathbf{N}}\cdot\boldsymbol{\mathbf{L}}$ factor such as the networks learns a multiplicative factor over the diffuse reflectance, so thus we parameterise:

\text{B}(\boldsymbol{\mathbf{N}},\boldsymbol{\mathbf{L}}_{m},\boldsymbol{% \mathbf{V}})=(\boldsymbol{\mathbf{N}}\cdot\boldsymbol{\mathbf{L}}_{m})\text{ % exp\big{(}SIREN}(\boldsymbol{\mathbf{N}}\cdot\boldsymbol{\mathbf{H}}_{m},% \boldsymbol{\mathbf{N}}\cdot\boldsymbol{\mathbf{L}}_{m},\boldsymbol{\mathbf{H}% }_{m}\cdot\boldsymbol{\mathbf{L}}_{m})\big{)}

(2)

Albedo. The scalar albedo $\rho$ is learned with another SIREN network which is queried for every sample point.

Shadow estimation. Handling cast shadows is imperative for any successful photometric stereo method. To estimate cast shadows for a point, we raytrace from that point to the light source following the direction of the lighting vectors $\boldsymbol{\mathbf{l_{m}}}$ computed above. For each ray we take 16 random samples $j$ $\in[2\text{mm},50\text{mm}]$ . For all these points, we query the SDF network and accumulate opacity following the same volumetric rendering procedure, i.e $s=\prod_{j}(1-\alpha_{j})$ . We note that this shadow computation procedure is very computationally expensive therefore it cannot be computed for all the rendered points. Instead, we only compute it for the average intersection point of each ray ( $\boldsymbol{\mathbf{X}}_{A}$ is Figure 3) and assume it constant for all the other ray samples.

Indirect reflectance. Finally, the indirect reflectance $i_{r,m}$ (Equation 1) is approximated as $i_{r,m}=a\rho\tilde{s}R(\boldsymbol{\mathbf{X}})$ . $\tilde{s}$ is compute as the mean value of one minus shadows (over the number of lights) and its a measure of the local surface concavity; $\tilde{s}=0$ corresponds to no shadows and hence completely convex local geometry therefore no intereflection, $\tilde{s}=1$ corresponds to all lights being shaded and hence maximum concavity therefore we allow for maximum learned intereflection. Finally $R(\boldsymbol{\mathbf{X}})$ is queried from another SIREN network and we note that it is parameterised to be constant on all lights.

Some visualisation of the generated rendering for synthetic and real expeiments are shown in Figure 4.

3.3 Initialisation

As it is standard in competing approaches i.e. [40], we can use single view PS (at each view) in order to get normal and depth estimates. We start by computing per view normal maps using the state-of-the-art PS normal estimation network [9] and also use numerical integration [35] in order to get approximate depth maps. The depth maps do need to be accurate as they are only used to initialise the ray search spaces. The normal maps can be used in order to provided an additional training signal.

The SDF newtwork is initialised with weights that approximate the SDF of a perfect sphere. to speed up convergence, we always run 3 epochs with normal and silhouette loss without rendering (which is a lot more computationally expensive).

Final surface calculation. After the optimisation is completed, the SDF network can be sampled in a regular grid of points and a triangle mesh surface can be recovered using the standard marching cubes [29] algorithm. For these revered vetrices, the albedo network can be queried in order to obtained a textured reconstruction.

3.4 Losses

We use the following losses:

Rendering loss. We use L1 loss on the rendered intensities, i.e. $L_{rend}=|i_{rend}-i_{gt}|$ . We note that to better balance the rendering data, each light source is scaled so as the maximum GT intenisty is 1 (saturated). This is performed because some of the DiliGenT-MV lights are very dark. The relative weighting of this loss when used is 1e3.

Silhouette loss. As it is standard in volumetric SDF approaches, we apply binary cross entropy loss to the predicted silhuettes to the ground truth ones. We generate predicted silhuettes with the total accumulated opacity $\alpha_{total}=\prod_{j}(1-\alpha_{j})$ (which should be 1 for foreground and 0 for background points), $L_{sil}=\text{BCE}(\alpha_{total},mask)$ . We consider a 5 pixel band outside the provided segmentation masks for background points. We note that we ignore the utmost pixels of the provided segmentation masks and also the innermost pixels of the background (leaving 2 pixel boundaries between foreground and background). This is done for better numerical stability as the exact discritisation of the boundaries is not very reliable. The relative weighting of this loss is 1e1.

Normal loss. We apply angular normal loss to match SDF computed normals $\boldsymbol{\mathbf{N}}_{s}$ to the network predicted normals $\boldsymbol{\mathbf{N}}_{n}$ . We follow [25, 26] and use angular error (as opposed to just L2 used in approaches like [40]) $L_{n}=|\text{atan2}(||\boldsymbol{\mathbf{N}}_{n}\times\boldsymbol{\mathbf{N}}% _{s}||,\boldsymbol{\mathbf{N}}_{n}\cdot\boldsymbol{\mathbf{N}}_{s})|$ For rays where the accumulated opacity is less than 0.01 (i.e the ray does not intersect any surface), no normal loss is applied. In addition, following previous works, the normal loss is weighted by the obliqueness of each point $(\boldsymbol{\mathbf{N}}\cdot\boldsymbol{\mathbf{V}})$ and that stops the optimisation to try to fit occlusion boundaries which are numerically unstable. The relative weighting of this loss when used is 1.

Eikonal loss. To enforce the Eikonal equation for all ray samples, we apply L1 loss which is the standard in most SDF approaches i.e. $L_{eik}=|~{}||\nabla d||-1|$ . The relative weighting of this loss is 1e1.

Free space regulariser. It is important to use some sort of a regulariser to minimise free space artifacts on the inside and outside of the object, in unseen regions. To do that, at each ray, the SDF needs to be encouraged to get more negative on the points behind the first ray/surface intersection, with penalty weight the different of the target sign $s_{t}$ (-1 for inside, +1 for outside) to the soft sign i.e. $\text{softsign}(d)=\frac{d}{1+|d|}$ . So we compute $L_{reg}=|\text{softsign}(d)-s_{t}|$ . The is graphically shown in Figure 3 where $\boldsymbol{\mathbf{X}}_{r1}$ is infront of $\boldsymbol{\mathbf{X}}_{A}$ and thus has $s_{t}=1$ , $\boldsymbol{\mathbf{X}}_{r6}$ is behind and thus has $s_{t}=-1$ . Finally, to minimise the effect of regulariser on the surface evolution, we only apply it to points where their opacity contribution is less than 1% (i.e. far away from $\boldsymbol{\mathbf{X}}_{A}$ , colored red on Figure 3). The relative weighting of this loss is 1e-1.

4 Experiment Setup

This section explains the datasets used for evaluation of our method along with the evaluation metric and competitors.

4.1 Datasets

DiLiGenT-MV. Our main evaluation target is DiLiGenT-MV [21] containing 5 objects with 96 lights in 20 views. Images are $612\times 512$ resolution with objects actually occupying a maximum of $400\times 400\text{px}$ . GT meshes, camera instrisics and extrinsics and normal maps are provided, as well as point light positions and far-field light brightness ( $\phi$ ). We note that these brightness were measure from the intensity of a flat calibration target roughly positioned at the object, so intrinsic brightness is recovered by multiplying with inverse distance square. We note that as these calibration data are unavailable, the $\phi$ is expected to be fairly inaccurate and thus is also optimised during training.

Dataset anomalies. In order to have the most fair assesement, we report the following dataset anomalies on DiLiGenT-MV and our attempts to overcome them. Firstly, the provided GT normal maps are incompatible with the GT meshes as well as the provided intrinsics and extrinsics. The best compatibility was achieved by using the intrinsic matrix for Reading for all objects while simultaneously performing SDV to fix the non-one determinant on the provided rotation matrices (especially for Reading). Despite that, there is still impossible to get a perfect match between provided normals and meshes so the exact parameters remain uncertain. For all experiments expect the first line of Table 2, we use the modified parameters (Reading instrisics and endorsing unit determinant on rotation matrices R with SVD).

In addition, the provided segmentation masks in Bear and Cow contained holes that had to be closed manually otherwise the silhouette loss would make big holes on the reconstructed meshes.

Finally, Ikehata in [11] first noticed that the first 20 images of the Bear appear to be corrupted. We also found more similarly corrupted images on other views (more visualisation in the supplementary) and did out best effort to manually mark and ignore them but it is possible that more image corruptions are still unnoticed.

Synthetic DiLiGenT-MV. To better demonstrate the effectiveness of our method without the real data corruptions discussed above, we rendered a synthetic version of DiLiGenT-MV with Blender. See Figure 4. We use the exact same objects, with the exact same poses and rendered the 96 points lighst. The objects materials where chosen to loosely mimic real objects and the albedo was set to a random pattern. Finally, we note that this synthetic data can be used to visualise shadow and indirect reflection maps which are really hard to correctly evaluate on real data. This is explored in the supplementary.

4.2 Hyperparameters and Training

We use a tensorflow port of SIREN for all the experiments. The SDF network is set to 5x512 layers (1.05M parameters) and the albedo and self reflection network to 3x256 (133K parameters). The BRDF network is set to 3x32 layers (2.5K parameters). We use 64 ray samples in a 100mm ray range and an additional 64 around the average intersection (in a shrinking distance range up to 10mm) and 64 extra samples computed with one step of Newton method (for approximating the 0 of the SDF). For each shadow ray we used 16 samples. We train with batch size of 512 rays for 100 epochs which takes approx. 20h on a NVIDIA TITAN RTX and 17GB of RAM when rendering all 96 lights. Note that the 6 lights versions completes in only 5h.

4.3 Evaluation protocol

We evaluate our method by computed Chamfer distance (marked as surface error SE) of the reconstructions and the ground truth . This is computed as the average of Hausdorf distance from reconstruction to ground truth and the opposite, with the distances computed with with Meshlab. We note that in order to have a fair comparison and not bias the error with unseen bottom of the objects, the bottom 6mm of GTs and all reconstructions are removed. All DiLiGenT-MV objects are aligned to be touching the XY plane so the crop** is straightforward. This crop** also avoids large error at some parts of the bottom of the objects that are occluded by the background (e.g. the feet of Reading, see supplementary).

Competing approaches. We compare against DiLiGeNT [21], PS-NeRF [40], MVAS [2] and SuperNormal [1]. For all of the methods, the output reconstructions were downloaded from the SuperNormal project page¹¹1https://xucao-42.github.io/homepage/. Synthetic data is only used to ablate our own method as the DiLiGeNT and SuperNormal code is not available (and our method using only norma loss is very similar to PS-NeRF and SuperNormal). We also note that PS-NeRF and MVAS are also using a modified/pre-procesed version of DiLiGeNT and this is potentially alleviating some of the anomalies (especially the segmentation mask holes).

5 Experiments

We describe two sets of experiments on synthetic data and real data in Sections 5.1 and 5.2 correspondingly.

5.1 Synthetic data

As mentioned in Section 4.1 the original DiLiGenT-MV dataset contains several anomalies making hard to correctly ablate the several steps of our method. Thus ablation experiments are performed on Synthetic-DiLiGenT-MV dataset described in Section 4.

Method	Bear	Buddha	Cow	Pot2	Reading	Average SE
GT Normals	0.06	0.08	0.03	0.04	0.02	0.05
Ours - [normals. only]	0.17	0.14	0.10	0.14	0.16	0.14
Ours - [intensities only]	0.13	0.29	0.07	0.20	0.11	0.16
Ours - [normals. + intensities.]	0.11	0.17	0.04	0.18	0.11	0.12

Table 1: Ablation study on DiLiGenT synthetic data. Using normals and intensities for shape estimation achieves the best error. Particularly note that it outperforms both other configurations on Bear and Cow objects indicating the the combined approach is better than a simple interpolation between the two.

We first show that our network can achieve a very low error of 0.05mm when using the ground truth normals which is not the case for the real DiLiGenT-MV dataset as shown in Table 2. We also show that if predicted normals are used the performance is worse but still quite good (ie. 0.14mm). Using intensities only achieves overall accuracy of 0.16mm. In addition, using intensities only significantly outperforms the normals only network on Bear, Cow, Reading but it is inferior on Buddha and Pot2. The reason for such a differing performance is the distribution of self reflection that affect the rendering performance more than the normal estimation network. The combined normals and intensities experiment achieves the best accuracy of 0.12mm average error. In addition, on Bear and Cow objects, the accuracy using both losses is better than any of the individual ones (0.17, 0.13 vs 0.11 for Bear) and (0.10, 0.07 vs 0.04 for Cow) therefore showing the ability to combine best of both training signals.

5.2 Real data

In this section we report our results on DiLiGenT-MV benchmark. We first show that there some issues with DiLiGenT-MV intrinsic camera parameters and/or ground truth normals. In particular, we compute the shape estimation error by using original intrinsic parameters and obtain a significantly higher error of estimated shape than on the synthetic DiLiGenT-MV benchmark. However if we take the intrinsic parameters from Reading object and use it for all objects, the error is drastically reduced (0.11 to 0.03). We note that for the Reading object, the difference of 0.06 to 0.05 is explained by adjusting the rotation matrices with SVD.

When looking at our network performance we show that we achieve state of the art results, only matching to the original DiLiGenT-MV benchmark and significantly outperform deep learning based competitors (ie. PS-NeRF [40] and MVAS [2]). Note in the original paper of PS-NeRF the performance was compared by including the bottom (invisible) part of the object which gave misleading results of deep learning method outperforming the classical one. We believe that the classical method performs so well as the objects have relatively simple geometry (especially the Cow where is mostly convex and has the least amount of shadows and self reflection). The reconstructed meshes are shown in Figure 5.

Note we also show that if we reduce the number of lights available (e.g. 6) and hence decrease the quality of normals reduces and the impact of rendering increases, justifying our key contribution. If a more challenging multi-view photometric stereo benchmark were to be available our approach would have display a larger gain.

Method	Bear	Buddha	Cow	Pot2	Reading	Average SE
GT normals	0.13	0.15	0.10	0.12	0.06	0.11
GT normals (corr. params)	0.02	0.06	0.01	0.02	0.05	0.03
DiLiGeNT [21]	0.22	0.33	0.08	0.21	0.25	0.22
PS-NeRF [40]	0.27	0.33	0.27	0.26	0.36	0.30
MVAS [2]	0.25	0.37	0.21	0.20	0.52	0.31
SuperNormal [1]	0.21	0.21	0.19	0.15	0.21	0.19
Ours - [normals only] - 6 lights	0.59	0.46	0.34	0.63	0.45	0.49
Ours - [intensities only] - 6 lights	0.22	0.29	0.22	0.32	0.32	0.27
Ours - [normals + intensities] - 6 lights	0.25	0.38	0.24	0.52	0.33	0.34
Ours - [normals only]	0.29	0.19	0.17	0.19	0.29	0.22
Ours - [intensities only]	0.25	0.24	0.18	0.34	0.26	0.26
Ours - [normals + intensities]	0.21	0.19	0.17	0.20	0.22	0.20

Table 2: Results on DiLiGenT-MV [21] benchmark. For all objects we report the mean shape error as well as average shape error and average median shape error on all objects. Normals are computed using the universal PS method of [9].

Finally, we achieve slightly inferior performance to SuperNormal [1], mainly because of the Pot2 object (0.2 vs 0.15mm). However given the discovered issues with the DiLiGenT-MV benchmark and code (or the normal maps used for input) not being available for this method we refrain from drawing on concusions for the reasons why it slightly outperforms our method. One possible explanation is there are more locally corrupted images (like the specular highlights on Bear) that are interfering with are renderer but do not pose issue to their particular normal estimation network. Note that SuperNormal method itself is functionally very similar to ours when no intensities (ie SuperNormal makes use of an SDF trained with normals, silhouette and Eikonal losses) are used.

6 Conclusions

In this work we proposed a novel multi-view photometric stereo method. Unlike most MVPS methods our approach explicity leverages per-pixel intensity renderings rather than relying mainly on estimated normals. We believe such approach is required for truly applicable and robust MVPS as the estimated normals are likely to fail on complex materials or geometries. We clearly demonstrate the benefit of leveraging intensities on a synthetic and real DiLiGenT-MV benchmark and the applicability of our method on the minimal 6 lighst case.

References

[1] Cao, X., Taketomi, T.: Supernormal: Neural surface reconstruction via multi-view normal integration. CVPR (2024)
[2] Cao X., Santo H., O.F., Y., M.: Multi-view azimuth stereo via tangent space consistency. In Proc. of Computer Vision and Pattern Recognition (CVPR) (2023)
[3] Chen, G., Han, K., Shi, B., Matsushita, Y., Wong, K.Y.K.: Sdps-net: Self-calibrating deep photometric stereo networks. In: CVPR (2019)
[4] Du, H., Goldman, D.B., Seitz, S.M.: Binocular photometric stereo. In: BMVC (2011)
[5] Esteban, C.H., Vogiatzis, G., Cipolla, R.: Multiview photometric stereo. PAMI (2008)
[6] Getreuer, P.: Malvar-he-cutler linear image demosaicking. Image Processing on Line 1, 83–89 (2011)
[7] Goldman, D.B., Curless, B., Hertzmann, A., Seitz, S.M.: Shape and spatially-varying brdfs from photometric stereo. PAMI (2010)
[8] Guo, H., Santo, H., Shi, B., Matsushita, Y.: Edge-preserving near-light photometric stereo with neural surfaces. arXiv (2022)
[9] Hardy, C., Quéau, Y., Tschumperlé, D.: Uni MS-PS: a Multi-Scale Encoder Decoder Transformer for Universal Photometric Stereo (Feb 2024), https://hal.science/hal-04431103, working paper or preprint
[10] Hui, Z., Sankaranarayanan, A.C.: Shape and spatially-varying reflectance estimation from virtual exemplars. IEEE Trans. Pattern Analysis and Machine Intelligence (PAMI) 39(10), 2060–2073 (2017)
[11] Ikehata, S.: Cnn-ps: Cnn-based photometric stereo for general non-convex surfaces. In: ECCV (2018)
[12] Ikehata, S.: Universal photometric stereo network using global lighting contexts. CVPR (2022)
[13] Ikehata, S.: Scalable, detailed and mask-free universal photometric stereo. CVPR (2023)
[14] Ju, Y., Zhang, C., Huang, S., Rao, Y., Lam, K.M.: Learning deep photometric stereo network with reflectance priors. In: ICME (2023)
[15] Kaya, B., Kumar, S., Oliveira, C., Ferrari, V., Van Gool, L.: Uncertainty-aware deep multi-view photometric stereo. In: CVPR (2022)
[16] Kaya, B., Kumar, S., Oliveira, C., Ferrari, V., Van Gool, L.: Multi-view photometric stereo revisited. In: WACV (2023)
[17] Kaya, B., Kumar, S., Sarno, F., Ferrari, V., Van Gool, L.: Neural radiance fields approach to deep multi-view photometric stereo. In: WACV (2021). https://doi.org/10.48550/ARXIV.2110.05594, https://arxiv.longhoe.net/abs/2110.05594
[18] Kong, H., Xu, P., Teoh, E.K.: Binocular uncalibrated photometric stereo. In: ISCV (2006)
[19] Li, J., Li, H.: Neural reflectance for shape recovery with shadow handling. In: CVPR. pp. 16221–16230 (2022)
[20] Li, J., Li, H.: Self-calibrating photometric stereo by neural inverse rendering. In: ECCV. Springer (2022)
[21] Li, M., Zhou, Z., Wu, Z., Shi, B., Diao, C., Tan, P.: Multi-view photometric stereo: A robust solution and benchmark dataset for spatially varying isotropic materials. IEEE Trans. Image Process. (2020)
[22] Li, Z., Müller, T., Evans, A., Taylor, R.H., Unberath, M., Liu, M.Y., Lin, C.H.: Neuralangelo: High-fidelity neural surface reconstruction. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2023)
[23] Liu, Y., Wang, P., Lin, C., Long, X., Wang, J., Liu, L., Komura, T., Wang, W.: Nero: Neural geometry and brdf reconstruction of reflective objects from multiview images. In: SIGGRAPH (2023)
[24] Logothetis, F., Budvytis, I., Cipolla, R.: A neural height-map approach for the binocular photometric stereo problem. In: WACV (2023)
[25] Logothetis, F., Budvytis, I., Cipolla, R.: A neural height-map approach for the binocular photometric stereo problem. WACV (2024)
[26] Logothetis, F., Budvytis, I., Mecca, R., Cipolla, R.: PX-NET: Simple, Efficient Pixel-Wise Training of Photometric Stereo Networks. In: ICCV (2021)
[27] Logothetis, F., Mecca, R., Budvytis, I., Cipolla, R.: A cnn based approach for the point-light photometric stereo problem. IJCV (2022)
[28] Logothetis, F., Mecca, R., Cipolla, R.: A differential volumetric approach to multi-view photometric stereo. In: ICCV (2019)
[29] Lorensen, W.E., Cline, H.E.: Marching cubes: A high resolution 3d surface construction algorithm. In: Seminal graphics: pioneering efforts that shaped the field, pp. 347–353 (1998)
[30] Matusik, W., Pfister, H., Brand, M., McMillan, L.: A data-driven reflectance model. ACM TOG (2003)
[31] Mecca, R., Wetzler, A., Bruckstein, A., Kimmel, R.: Near Field Photometric Stereo with Point Light Sources. SIAM Journal on Imaging Sciences (2014)
[32] Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: Nerf: Representing scenes as neural radiance fields for view synthesis. In: ECCV (2020)
[33] Nießner, M., Zollhöfer, M., Izadi, S., Stamminger, M.: Real-time 3d reconstruction at scale using voxel hashing. ACM Trans. Graph. (2013)
[34] Park, J., Sinha, S.N., Matsushita, Y., Tai, Y.W., Kweon, I.S.: Robust multiview photometric stereo using planar mesh parameterization. IEEE Transactions on Pattern Analysis and Machine Intelligence 39(8), 1591–1604 (2017)
[35] Quéau, Y., Durou, J.D.: Edge-preserving integration of a normal field: Weighted least squares, TV and L1 approaches. In: SSVM (2015)
[36] Sitzmann, V., Martel, J.N., Bergman, A.W., Lindell, D.B., Wetzstein, G.: Implicit neural representations with periodic activation functions. In: NeurIPS (2020)
[37] Verbin, D., Hedman, P., Mildenhall, B., Zickler, T., Barron, J.T., Srinivasan, P.P.: Ref-nerf: Structured view-dependent appearance for neural radiance fields. CVPR (2022). https://doi.org/10.48550/ARXIV.2112.03907, https://arxiv.longhoe.net/abs/2112.03907
[38] Wang, C., Wang, L., Matsushita, Y., Huang, B., Chen, M., Soong, F.K.: Binocular photometric stereo acquisition and reconstruction for 3d talking head applications. In: INTERSPEECH (2013)
[39] Woodham, R.J.: Determining surface curvature with photometric stereo. In: ICRA (1989)
[40] Yang, W., Chen, G., Chen, C., Chen, Z., Wong, K.Y.K.: Ps-nerf: Neural inverse rendering for multi-view photometric stereo. In: ECCV (2022)
[41] Yang, W., Chen, G., Chen, C., Chen, Z., Wong, K.Y.K.: S3-NeRF: Neural reflectance field from shading and shadow under a single viewpoint. In: Oh, A.H., Agarwal, A., Belgrave, D., Cho, K. (eds.) NeurIPS (2022), https://openreview.net/forum?id=tvwkeAIcRP8
[42] Yariv, L., Gu, J., Kasten, Y., Lipman, Y.: Volume rendering of neural implicit surfaces. In: Thirty-Fifth Conference on Neural Information Processing Systems (2021)
[43] Yariv, L., Kasten, Y., Moran, D., Galun, M., Atzmon, M., Ronen, B., Lipman, Y.: Multiview neural surface reconstruction by disentangling geometry and appearance. Advances in Neural Information Processing Systems 33, 2492–2502 (2020)
[44] Zhang, J., Yao, Y., Li, S., Liu, J., Fang, T., McKinnon, D., Tsin, Y., Quan, L.: Neilf++: Inter-reflectable light fields for geometry and material estimation. In: International Conference on Computer Vision (ICCV) (2023)
[45] Zhao, D., Lichy, D., Perrin, P.N., Frahm, J.M., Sengupta, S.: Mvpsnet: Fast generalizable multi-view photometric stereo. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 12525–12536 (October 2023)
[46] Zhou, Z., Wu, Z., Tan, P.: Multi-view photometric stereo with spatially varying isotropic materials. In: IEEE Conference on Computer Vision and Pattern Recognition (2013)
[47] Zollhöfer, M., Dai, A., Innmann, M., Wu, C., Stamminger, M., Theobalt, C., Nießner, M.: Shading-based refinement on volumetric signed distance functions. ACM Trans. Graph. (2015)

Appendix 0.A Appendix

This document contains supplementary material. It is organised as follows: DiLiGenT-MV data anomalies in Section 0.A.1 and visualisation of shadows and indirect reflectance on synthetic data in Section 0.A.2.

0.A.1 DiLiGenT-MV data anomalies

This section gives additional information about identified data anomalies in DiLiGenT-MV data. First of all, we report that the rotation matrices for the Reading object do not have determinant 1 (as valid rotation matrices should). For example, for view 1 this is shown in Equation 3:

R_{1}=\begin{bmatrix}0.0238&1.0031&-0.0137\\ 0.4530&-0.0230&-0.8912\\ -0.8912&0.0150&-0.4533\\ \end{bmatrix}\text{ with }det(R1)=1.0035

(3)

In addition, we note that the intrinsic matrix is different for the Reading object than the rest with the difference being in the x axis focal length as well as the principal point as shown in Equation 5:

	$\displaystyle K_{Reading}$	$\displaystyle=\begin{bmatrix}3759.1&0&305.5\\ 0&3759&255.5\\ 0&0&1\\ \end{bmatrix}$		(4)
	$\displaystyle K_{rest}$	$\displaystyle=\begin{bmatrix}3772.1&0&305.875\\ 0&3759&255.875\\ 0&0&1\\ \end{bmatrix}$		(5)

As we show in Table 2 of the main submission, the best performance and compatibility with the supplied GT normal maps was achieved with using the Reading intrinsics for all objects, as well as fixing the scaling in rotation matrices with SVD.

In addition, we also note that on the Bear object various images appear to be corrupted as shown in Figure 6. This has been a known issue for the first view (firstly noted by Ikehata in [11]) but we found corrupted images in other views. As most of the images are very dark, this is not easy to notice unless the brightness is adjusted.

Finally, we note that the background seems to be occluding part of the bottom for some objects as shown in Figure 7. This justifies our choise of removing the bottom 6mm of all objects for all methods in evaluating reconstruction accuracy.

0.A.2 DiLiGenT-MV-Synthetic visualisation

This section contains additional visualisations for the relationship between shadows and indirect reflectance which is obtained by investigating the synthetic version of DiLiGenT-MV. This is shown in Figure 8 where it is shown that in regions with a high number of shadows (i.e. ambient occlusion $\tilde{s}$ ), indirect reflectance is higher.