GenS: Generalizable Neural Surface Reconstruction from Multi-View Images
Abstract
Combining the signed distance function (SDF) and differentiable volume rendering has emerged as a powerful paradigm for surface reconstruction from multi-view images without 3D supervision. However, current methods are impeded by requiring long-time per-scene optimizations and cannot generalize to new scenes. In this paper, we present GenS, an end-to-end generalizable neural surface reconstruction model. Unlike coordinate-based methods that train a separate network for each scene, we construct a generalized multi-scale volume to directly encode all scenes. Compared with existing solutions, our representation is more powerful, which can recover high-frequency details while maintaining global smoothness. Meanwhile, we introduce a multi-scale feature-metric consistency to impose the multi-view consistency in a more discriminative multi-scale feature space, which is robust to the failures of the photometric consistency. And the learnable feature can be self-enhanced to continuously improve the matching accuracy and mitigate aggregation ambiguity. Furthermore, we design a view contrast loss to force the model to be robust to those regions covered by few viewpoints through distilling the geometric prior from dense input to sparse input. Extensive experiments on popular benchmarks show that our model can generalize well to new scenes and outperform existing state-of-the-art methods even those employing ground-truth depth supervision. Code is available at https://github.com/prstrive/GenS.
1 Introduction
Surface reconstruction from multi-view images is a cornerstone task in computer vision with many applications in virtual reality, autonomous driving, robotics, etc. Typical solutions kazhdan2006poisson ; kazhdan2013screened ; galliani2015massively ; schonberger2016structure ; yao2018mvsnet ; xu2019multi ; gu2020cascade ; peng2022rethinking in the past were mostly based on a multi-step pipeline, which includes depth estimation, depth fusion and meshing. Although they have demonstrated their excellent performance, the procedure is cumbersome and inevitably introduces cumulative errors. While several early works niemeyer2020differentiable ; yariv2020multiview used differentiable surface rendering to directly reconstruct surfaces, recent works oechsle2021unisurf ; wang2021neus ; yariv2021volume , inspired by the huge success of neural radiance field (NeRF) mildenhall2020nerf in synthesizing novel views, follow the volume rendering max1995optical to represent the 3D geometry with an occupancy field mescheder2019occupancy or signed distance function (SDF) park2019deepsdf and can achieve more impressive results.
The key idea of these approaches is to train a compact multi-layer perceptrons (MLPs) to predict the implicit representation (e,g., SDF) of each sampled point on camera rays. The density of volume rendering is then regarded as a function of this implicit representation, and alpha-composition of samples is performed to produce the corresponding pixel color and geometry information. However, existing methods are hampered by requiring a lengthy per-scene optimization procedure and cannot generalize to new scenes, which makes them infeasible for many application scenarios. A recent method long2022sparseneus attempts to address these issues through conditioning the SDF-induced model with features extracted from sparse nearby views. Nevertheless, its accuracy is limited due to the smooth reconstruction, and the multi-stage process it relies on is prone to introducing cumulative errors. In this paper, we seek to establish an end-to-end generalizable model which can efficiently infer finer 3D structure. Compared with existing methods wang2021neus ; yariv2021volume , this generalization system faces more challenging problems. First, it’s non-trivial to efficiently represent the scene. Previous methods murez2020atlas ; long2022sparseneus ; chen2021mvsnerf ; wang2021ibrnet either build a global volume or employ feature projections, but they have proven to be either lacking in detail or unsuitable for view independent surface reconstruction. Second, relying only on the rendering loss is difficult to reconstruct compact geometry, since the multi-view consistency is ignored. And we found that the ordinary photometric consistency also cannot effectively solve this problem for our generalizable model because of the existence of ambiguous areas such as low-texture and reflection. Last but not least, since generalizable models heavily rely on aggregation quality, how to infer smooth geometry when the input is sparse is a thorny issue.
To this end, we introduce GenS to tackle these challenges. The main ideas behind are as follows: 1) We first construct a generalized multi-scale volume to represent the scene, which preserves global smoothness through the low-resolution volumes and recovers geometric details from high-resolution volumes. Meanwhile, low-dimensional features make our model more lightweight compared to a single large-width volume. 2) We introduce the multi-scale feature-metric consistency, which enforces multi-view consistency in the multi-scale feature space, to replace the common photometric consistency. Compared with the original image space, learnable multi-scale features can provide more discriminative representation, and the feature space can be self-enhanced during the generalization training process to continuously improve the matching accuracy. 3) Inspired by the fact that the reconstruction with dense inputs is more accurate, we propose a view contrast loss to force the model to better perceive the geometry of regions visible by few viewpoints through teaching the reconstruction from sparse inputs with dense inputs.
To demonstrate the quantitative and qualitative effectiveness of GenS, we conduct extensive experiments on DTU jensen2014large and BlendedMVS yao2020blendedmvs datasets. Results show that our model can outperform existing state-of-the-art generalizable method long2022sparseneus , and even recent method ren2022volrecon which adopts the ground-truth depth for supervision. Compared with the per-scene overfitting methods wang2021neus ; yariv2021volume ; yariv2020multiview ; oechsle2021unisurf ; long2022sparseneus , we can also achieve comparable or superior results with dense inputs. Some comparisons are shown in Fig. 1. In summary, our main contributions are highlighted below: a) We present a powerful representation based on our generalized multi-scale volume, which can efficiently reconstruct smooth and detail surfaces from multi-view images. b) We introduce a more discriminative multi-scale feature-metric consistency to successfully constrain the geometry, which helps the generalization model converge to the optimum. c) We propose a view contrast loss to improve the geometric smoothness and accuracy when the visible viewpoint is limited. d) Our model can be trained end-to-end and achieve state-of-the-art reconstructions in both generic setting and per-scene optimization setting.
2 Related work
Classical multi-view reconstruction.
Reconstructing 3D geometry from multi-view images is a longstanding problem in the field of 3D vision. Classical algorithms mainly adopt depth-based or voxel-based methodology to solve this problem. Multi-view stereo (MVS) is a typical class of depth-based methods, which takes stereo correspondence from multiple images as the main cue to reconstruct depth maps. While previous traditional MVS methods barnes2009patchmatch ; schonberger2016structure ; galliani2015massively ; furukawa2009accurate ; xu2019multi ; schonberger2016pixelwise relied on the hand-crafted similarity metrics, many recent learning-based methods yao2018mvsnet ; gu2020cascade ; wang2021patchmatchnet ; peng2022rethinking apply deep learning to achieve more discriminative matching. These methods go through complicated procedures to retrieve surface, including depth filtering, fusion and meshing kazhdan2006poisson ; bernardini1999ball , and are prone to cumulative errors. On the other hand, voxel-based methods seitz1999photorealistic ; kutulakos2000theory ; izadi2011kinectfusion ; niessner2013real directly model objects in a volume, but they are restricted to memory, which is the common drawback of the volumetric representation, and cannot achieve high accuracy.
Neural surface.
Due to the notable advantages of being able to achieve high spatial resolution, neural implicit functions have recently gained a lot of attention and have emerged as an effective representation of 3D geometry takikawa2021neural ; mescheder2019occupancy ; park2019deepsdf ; genova2019learning ; michalkiewicz2019implicit ; niemeyer2019occupancy ; peng2020convolutional ; saito2019pifu and appearance liu2020neural ; mildenhall2020nerf ; liu2020dist ; oechsle2019texture ; sun2022direct ; yu2021plenoctrees ; pumarola2021d ; muller2022instant . Furthermore, some works mildenhall2020nerf ; liu2019learning ; niemeyer2020differentiable have proposed to train models without 3D supervision via differentiable rendering, e.g., surface rendering and volume rendering. Methods adopt surface rendering niemeyer2020differentiable ; yariv2020multiview ; zhang2021learning only consider a single surface intersection point for each ray and fail to reconstruct complex objects, and they are restricted by the need of accurate object masks and careful weight initialization. On the contrary, recent methods use volume rendering oechsle2021unisurf ; wang2021neus ; yariv2021volume ; yu2022monosdf to take multiple points along the ray into consideration and achieve more impressive results. However, either type of method requires an expensive per-scene optimization and cannot generalize to new scenes.
Generalizable neural surface.
In the field of novel view synthesis, some methods chen2021mvsnerf ; wang2021ibrnet ; yu2021pixelnerf ; johari2022geonerf have successfully introduced the generalization into rendering methods. These methods suffer from the same problem as NeRF: the geometry is ambiguous. Few works have focused on the generalization of neural surface reconstruction. A recent study, SparseNeuS long2022sparseneus , is the first attempt to achieve this by reconstructing the surface from nearby viewpoints in a multi-stage manner. Nevertheless, its reconstruction lacks details, and same to the classical 3D reconstruction, the multi-stage pipeline may accumulates errors at each stage. On the contrary, our designed model can be trained end-to-end and reconstruct smoother and more refined geometries.
3 Method
Given posed images of an object taken from different viewpoints, our goal is to reconstruct the surface as an implicit function without expensive per-scene optimization or only by fast fine-tuning. Our overall framework is depicted in Fig 2. We first introduce how to infer the geometry and appearance from the generalized multi-scale volume in Sec. 3.1, then elaborate on the necessity and implementation of the multi-scale feature-metric consistency in Sec. 3.2, and finally detail the realization of view contrast loss in Sec. 3.3 and the overall pipeline in Sec. 3.4.
3.1 Geometry and color reasoning from the generalized multi-scale volume
Compared with existing solution long2022sparseneus , which relies on a single volume and multi-stage strategy, we have three main advantages. First of all, our generalized multi-scale volume is a more powerful representation, which implicitly decouples geometry into base structures in low-resolution volumes and high-frequency details in high-resolution volumes. Second, with the low-dimensional features, we can construct multi-scale volumes with higher resolution and less memory consumption than a single large-width volume. Besides, our model can be trained end-to-end, avoiding cumulative errors.
Generalized multi-scale volume construction.
Suppose there are images of an object, we first apply the FPN network lin2017feature to extract multi-scale feature maps for all images with shared weights, and different volumes are then constructed from features at corresponding scales. In this paper, we define a bounding box of interest in the reference frustum like long2022sparseneus and in the world coordinate system like wang2021neus ; zhang2020nerf++ when dense inputs are available. We adopt a combination of volumes , which cover the same region but with different resolutions .
Here, we discuss at the first scale and omit the scale subscript for convenience. Given camera intrinsics and extrinsics , we first project the voxel onto viewpoint ’s pixel position:
(1) |
where is an operator to convert homogeneous coordinates to cartesian coordinates. Then we can get the corresponding feature of each voxel on viewpoint through bilinear interpolation . To fuse features from all viewpoints , we adopt the same aggregation strategies to generate cost volume as in wang2021ibrnet that concatenates mean and variance to simultaneously capture statistical and semantic information: .
Simply repeating the above process on features and volumes of all scales, we can get the multi-scale cost volumes . Next, we further design an efficient multi-scale 3D network to refine these cost volumes in one forward, starting from the finest volume and injecting the others into different stages of the model to save memory. The output of the 3D network is the multi-scale volume that we need to infer the geometry.
Geometry reasoning.
For an arbitrary 3D point , we first get the interpolation of volumes at all scales through trilinear sampling, and then concatenate them as the final feature , where . Combining the feature and the point position, an MLP network is applied to predict the corresponding SDF value: . And the surface is represented by the zero-level set of the SDF value:
(2) |
Color prediction.
We refer to the first viewpoint as the reference image. To predict the color of each point on a ray, we employ the blending strategy similar to wang2021ibrnet . We first project the 3D point to source views’ pixel position according to Eq. 1, and interpolate the corresponding colors and features . Here, we only use the highest resolution features to predict the color. Next, an MLP network take the concatenation of features and viewing direction differences as input, to predict the softmax-activated blending weights of each source view, and the final color is blended as the weighted sum of source colors:
(3) |
SDF-based volume rendering.
Given the density and color of samples along the ray emitting from camera center to pixel in view direction , NeRF mildenhall2020nerf approximates the color using numerical quadrature:
(4) |
where is the accumulated transmittance, and in original volume rendering. To better approximation the geometry of the scene, NeuS wang2021neus proposed an unbiased and occlusion-aware weighting method to incorporate signed distance, and the is formulated as:
(5) |
Here, is the sigmoid function and is an anneal factor. Readers can refer to wang2021neus for more details.
3.2 Multi-scale feature-metric consistency
Rendering loss tends to trap the model into sub-optimization since it only considers a single point and ignores the consistency among multiple viewpoints. To mitigate this problem, a straightforward practice is to project the image patches of multiple views to the estimated surface location based on the local planar assumption and rely on the photometric consistency to enforce the multi-view consistency. However, we found this solution works well for per-scene overfitting training darmon2022improving ; geoneus but brings limited benefits to generalization training.
We analyze that the main reason may be the failure of photometric consistency, which becomes more challenging for generalization training. As proven in recent self-supervised multi-view stereo methods xu2021digging ; xu2021self ; yang2021self ; qiu2022self , the assumption of photometric consistency isn’t always effective, and the predicted geometry still has significant holes even in combination with the robust patch similarity like SSIM wang2004image . As the coordinate-based methods train models separately for each scene to directly overfit the scene, they have greater potential to converge to the optimum. However, our generalization model encodes all scenes with one model, and it requires image features to infer geometry, which makes the model rely heavily on the discriminability of features, e.g., regions like low-texture and reflection become more critical for degrading results. As shown in Fig 3 (a), those regions violating photometric consistency not only reduce the accuracy of multi-view matching, but also decrease the discriminability of generalization model’s input (we call this aggregation ambiguity), while the input of overfitting methods are distinct (3D coordinate).
To overcome these challenges, we propose the multi-scale feature-metric consistency to measure the consistency between views in a multi-scale feature space, as shown in Fig. 4. There are three main advantages of doing this way. First of all, the learnable feature is proven to be more discriminative than the original image johnson2016perceptual , especially on those ambiguous regions like low-texture and reflection. Second, due to the larger receptive field, multi-scale information is conducive to improving the matching accuracy, and allows the model to be assisted by global information while recovering details. More importantly, the feature discriminability can be continuously self-enhanced in the process of generalization training. The multi-scale feature space can train a powerful model through more accurate matching, and the more powerful model can in turn lead to a more discriminative feature space. And the enhanced feature can further mitigate the aforementioned aggregation ambiguity. These advantages have been proven in Tab. 3.
To generate the geometry, we adopt the same approximate method as geoneus to directly locate the zero-level set of the SDF. As shown in Fig. 5, We first find the interval where a ray intersects the surface by checking whether the signs of the SDFs of two adjacent sampling points are different. To handle occlusion, we only extract the surface within the first interval. Suppose the two samples of the interval are and , and their distances to the camera center are and respectively, our goal is to compute the position of . Here, we rely on an assumption that two adjacent samples are close enough that the near surface can be regarded as a local plane. In this way, we can get two similar triangles:
(6) |
Therefore, we can approximate the distance from the surface to the camera center as:
(7) |
We thus can get the coordinate of the surface point .
Through the automatic differentiation of the SDF network at , we can get the corresponding normal . Based on the assumption that the local surface centered at is a plane of normal , we can find the corresponding pixel position in source view that correspond to the pixel in reference view:
(8) |
For a pixel patch in the reference view, we can find the corresponding source patch through passing all pixels to Eq. 8 like . Regardless of occlusion, if the estimated surface is accurate, then these corresponding patches should also be consistent. In this paper, we measure patch consistency in a multi-scale feature space. We only apply features at the top 3 scales, since features at lower scales lose a lot of structural information. Therefore, for a pixel patch at a certain view, we can get the multi-scale patches through bilinear interpolation, and we upsample and concatenate them together as input , whose channel is , for patch similarity measure. Here, we employ the normalization cross correlation (NCC) to compute the feature-space consistency:
(9) |
where denotes covariance and refers to variance. Following the common solution in multi-view stereo field galliani2015massively , we compute the final multi-scale feature-space consistency loss as the average of the best NCCs:
(10) |
3.3 View contrast loss
For a 3D structure captured by multiple viewpoints, there is a fact that some regions are covered by enough viewpoints, while some regions are only visible to a few viewpoints. Compared with the former, the aggregated features of the latter are more likely to be polluted by irrelevant rays, making them less predictable. To solve this problem, we design a view contrast loss to improve the accuracy of the reconstruction when visible views are limited, which enforces the geometric estimation to be the same under different inputs of the same scene.
We empirically lets results from dense inputs to supervise results of sparse inputs. Specially, taking a set of multi-view images as input, we first reconstruct a multi-scale volume as a teacher, which is used to infer the finer SDF value for a set of 3D points . Then we build a student multi-scale volume from sparse input views and estimate the corresponding SDF value . Meanwhile, as shown in Fig. 3 (b), we found that only the sampling points falling on the surface have positive epipolar correspondences, and their aggregated features are more meaningful, while other samples are more random, and may obtain the same aggregation even if their SDF values are different. As shown in Fig. 6, we thus only calculate the consistency loss for near-surface points, whose finer SDF values are more accurate:
(11) |
where is a set of points close to the surface inferred from the fine SDF according to Eq. 7.
3.4 Overall pipeline
This section will introduce some implementation details and crucial components of our model including generalization training and fine-tuning.
Loss function.
The overall loss function is defined as:
(12) |
For a batch of sampled pixel set , the color loss is computed as the L1 distance between the rendered color and the ground-truth:
(13) |
To make the geometry more compact and accurate, we apply the regularization loss which is composed of four terms:
(14) |
Eikonal loss gropp2020implicit is employed to regularize SDF values of all sampled points :
(15) |
To maintain the smooth of the surface, we introduce a regularization to the gradient of the normal:
(16) |
where is the alpha composition of normal gradient in a ray through pixel . Besides, we also adopt the total variation (TV) regularization rudin1994total for our multi-scale volumes:
(17) |
To clean the geometric estimation, we introduce a sparsity prior:
(18) |
Generalization training.
We select for sparse setting and for dense setting. We use Adam optimizer kingma2014adam with the base learning rate of 1e-3 for feature network and 5e-4 for other MLPs. We train the joint loss for 16 epochs on two A100 GPUs. We increase the value of from 0 to 1 and in the first 2 epochs. In our implementation, we generate the surface points of each image of the model trained with dense input first, and then distill the model with sparse input, with set to 1. We build the generalized multi-scale volume with 5 scales, whose resolution increase from to . Each volume is equipped with thin features with only 4 feature channels, which allows us to save memory compared to general single volume methods.
Fine-tuning.
After generalization training, we first reconstruct the generalized multi-scale volume, which has encoded the geometry information. Then we sparse the multi-scale volume by pruning voxels far from the surface. During fine-tuning, we abandon the feature network, and directly optimize the multi-scale volume and MLPs. With the generalization prior, we can achieve state-of-the-art performance in only about 20 minutes of fine-tuning.
4 Experiments
We demonstrate the state-of-the-art performance of GenS with comprehensive experiments and verify the effectiveness of each module through ablation studies. We first introduce the datasets and then analyze our results.
Datasets.
We conduct experiments on both DTU jensen2014large and BlendedMVS yao2020blendedmvs datasets as previous methods wang2021neus ; yariv2021volume ; long2022sparseneus . Our generalization model is trained on DTU dataset, which is an indoor MVS dataset with 124 different scenes scaned from 49 or 64 views with fixed camera trajectories. Following yariv2020multiview ; wang2021neus ; long2022sparseneus , we take the same 15 scenes for testing. The training set is defined as in yao2018mvsnet ; peng2022rethinking , and the test scenes contained therein are removed. We also evaluate our model on BlendedMVS, which is a large-scale synthetic dataset. Each scene is scaned from different number of views, and all images has a resolution of . We report the Chamfer Distance for DTU, and show some visual effects for BlendedMVS.
Method | 24 | 37 | 40 | 55 | 63 | 65 | 69 | 83 | 97 | 105 | 106 | 110 | 114 | 118 | 122 | Mean |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
VolRecon* ren2022volrecon | 1.20 | 2.59 | 1.56 | 1.08 | 1.43 | 1.92 | 1.11 | 1.48 | 1.42 | 1.05 | 1.19 | 1.38 | 0.74 | 1.23 | 1.27 | 1.38 |
PixelNerf yu2021pixelnerf | 5.13 | 8.07 | 5.85 | 4.40 | 7.11 | 4.64 | 5.68 | 6.76 | 9.05 | 6.11 | 3.95 | 5.92 | 6.26 | 6.89 | 6.93 | 6.28 |
IBRNet wang2021ibrnet | 2.29 | 3.70 | 2.66 | 1.83 | 3.02 | 2.83 | 1.77 | 2.28 | 2.73 | 1.96 | 1.87 | 2.13 | 1.58 | 2.05 | 2.09 | 2.32 |
MVSNerf chen2021mvsnerf | 1.96 | 3.27 | 2.54 | 1.93 | 2.57 | 2.71 | 1.82 | 1.72 | 2.29 | 1.75 | 1.72 | 1.47 | 1.29 | 2.09 | 2.26 | 2.09 |
SparseNeuS long2022sparseneus | 1.68 | 3.06 | 2.25 | 1.10 | 2.37 | 2.18 | 1.28 | 1.47 | 1.80 | 1.23 | 1.19 | 1.17 | 0.75 | 1.56 | 1.55 | 1.64 |
GenS | 1.45 | 2.77 | 1.69 | 0.97 | 1.54 | 1.90 | 1.03 | 1.49 | 1.36 | 0.97 | 1.07 | 0.97 | 0.62 | 1.14 | 1.16 | 1.34 |
NeuS wang2021neus | 4.57 | 4.49 | 3.97 | 4.32 | 4.63 | 1.95 | 4.68 | 3.83 | 4.15 | 2.50 | 1.52 | 6.47 | 1.26 | 5.57 | 6.11 | 4.00 |
VolSDF yariv2021volume | 4.03 | 4.21 | 6.12 | 0.91 | 8.24 | 1.73 | 2.74 | 1.82 | 5.14 | 3.09 | 2.08 | 4.81 | 0.60 | 3.51 | 2.18 | 3.41 |
IBRNet (ft) | 1.67 | 2.97 | 2.26 | 1.56 | 2.52 | 2.30 | 1.50 | 2.05 | 2.02 | 1.73 | 1.66 | 1.63 | 1.17 | 1.84 | 1.61 | 1.90 |
COLMAP schonberger2016pixelwise | 0.90 | 2.89 | 1.63 | 1.08 | 2.18 | 1.94 | 1.61 | 1.30 | 2.34 | 1.28 | 1.10 | 1.42 | 0.76 | 1.17 | 1.14 | 1.52 |
SparseNeuS (ft) | 1.29 | 2.27 | 1.57 | 0.88 | 1.61 | 1.86 | 1.06 | 1.27 | 1.42 | 1.07 | 0.99 | 0.87 | 0.54 | 1.15 | 1.18 | 1.27 |
GenS (ft) | 0.91 | 2.33 | 1.46 | 0.75 | 1.02 | 1.58 | 0.74 | 1.16 | 1.05 | 0.77 | 0.88 | 0.56 | 0.49 | 0.78 | 0.93 | 1.03 |
4.1 Comparisons
Results on DTU.
We first adopt the same testing split and configuration as long2022sparseneus to compare with existing generalizable methods chen2021mvsnerf ; wang2021ibrnet ; long2022sparseneus ; yu2021pixelnerf . The results shown in Tab. 1 indicate that our model outperforms existing methods by a significant margin, and this advantage can be amplified after rapid fine-tuning (about 20 mins). Even compared with recent method ren2022volrecon , which adopts the ground-truth depth for supervision, our model can achieve superior results. The qualitative results in Fig. 1 show that our reconstruction exhibits finer details. We further conducted more experiments on DTU to compare with per-scene overfitting methods with more input views. The quantitative comparisons in Tab. 2 show that our model can surpass some methods yariv2020multiview ; oechsle2021unisurf ; wang2021neus ; yariv2021volume just through a very fast network inference, i.e., we can achieve more than 34% improvement on scene 24 compared with wang2021neus . After a fast fine-tuning, the performance can be significantly improved, and even surpassing recent SOTA works darmon2022improving ; geoneus ; wang2022hf . Some visualization results in Fig. 7 depict that our model trained on large amounts of data is more robust to ambiguous regions.
Generalizing to BlendedMVS.
Like long2022sparseneus , we conduct additional verification on BlendedMVS to showcase the generalization ability of our model. We also employ the same evaluation strategy as long2022sparseneus for a fair comparison. As shown in Fig. 1, just through a fast network inference, our model can recover more geometric details than SparseNeuS long2022sparseneus . Through the fine-tuning of 5k iterations, the effect will be significantly improved and better than SparseNeuS with 10k iterations.
Method | 24 | 37 | 40 | 55 | 63 | 65 | 69 | 83 | 97 | 105 | 106 | 110 | 114 | 118 | 122 | Mean |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
IDR yariv2020multiview | 1.63 | 1.87 | 0.63 | 0.48 | 1.04 | 0.79 | 0.77 | 1.33 | 1.16 | 0.76 | 0.67 | 0.90 | 0.42 | 0.51 | 0.53 | 0.90 |
MVSDF zhang2021learning | 0.83 | 1.76 | 0.88 | 0.44 | 1.11 | 0.90 | 0.75 | 1.26 | 1.02 | 1.35 | 0.87 | 0.84 | 0.34 | 0.47 | 0.46 | 0.88 |
COLMAP schonberger2016pixelwise | 0.45 | 0.91 | 0.37 | 0.37 | 0.90 | 1.00 | 0.54 | 1.22 | 1.08 | 0.64 | 0.48 | 0.59 | 0.32 | 0.45 | 0.43 | 0.65 |
NeRF mildenhall2020nerf | 1.90 | 1.60 | 1.85 | 0.58 | 2.28 | 1.27 | 1.47 | 1.67 | 2.05 | 1.07 | 0.88 | 2.53 | 1.06 | 1.15 | 0.96 | 1.49 |
UNISURF oechsle2021unisurf | 1.32 | 1.36 | 1.72 | 0.44 | 1.35 | 0.79 | 0.80 | 1.49 | 1.37 | 0.89 | 0.59 | 1.47 | 0.46 | 0.59 | 0.62 | 1.02 |
VolSDF yariv2021volume | 1.14 | 1.26 | 0.81 | 0.49 | 1.25 | 0.70 | 0.72 | 1.29 | 1.18 | 0.70 | 0.66 | 1.08 | 0.42 | 0.61 | 0.55 | 0.86 |
NeuS wang2021neus | 1.00 | 1.37 | 0.93 | 0.43 | 1.10 | 0.65 | 0.57 | 1.48 | 1.09 | 0.83 | 0.52 | 1.20 | 0.35 | 0.49 | 0.54 | 0.84 |
HF-NeuS wang2022hf | 0.76 | 1.32 | 0.70 | 0.39 | 1.06 | 0.63 | 0.63 | 1.15 | 1.12 | 0.80 | 0.52 | 1.22 | 0.33 | 0.49 | 0.50 | 0.77 |
Voxurf wu2022voxurf | 0.65 | 0.74 | 0.39 | 0.35 | 0.96 | 0.64 | 0.85 | 1.58 | 1.01 | 0.68 | 0.60 | 1.11 | 0.37 | 0.45 | 0.47 | 0.72 |
NeuralWarp darmon2022improving | 0.49 | 0.71 | 0.38 | 0.38 | 0.79 | 0.81 | 0.82 | 1.20 | 1.06 | 0.68 | 0.66 | 0.74 | 0.41 | 0.63 | 0.51 | 0.68 |
Geo-NeuS* geoneus | 0.46 | 0.83 | 0.38 | 0.39 | 0.88 | 0.61 | 0.51 | 1.26 | 0.92 | 0.68 | 0.57 | 0.82 | 0.30 | 0.41 | 0.42 | 0.63 |
GenS | 0.66 | 1.01 | 0.71 | 0.43 | 1.06 | 0.99 | 0.73 | 1.43 | 1.18 | 0.78 | 0.64 | 0.93 | 0.38 | 0.54 | 0.54 | 0.80 |
GenS (ft) | 0.55 | 0.71 | 0.39 | 0.38 | 0.79 | 0.65 | 0.57 | 1.29 | 0.96 | 0.64 | 0.49 | 0.59 | 0.33 | 0.44 | 0.45 | 0.62 |
4.2 Analysis
Patch-similarity | Multi-scale | Self-enhanced | Mean |
✗ | ✗ | ✗ | 1.86 |
✓ | ✗ | ✗ | 1.76 |
✓ | ✓ | ✗ | 1.73 |
✓ | ✓ | ✓ | 1.62 |
Ablation studies.
We conduct ablation studies on DTU dataset to understand how the components of our model contribute to the overall performance. We start with our baseline model SparseNeuS long2022sparseneus , and gradually insert our contributions. The results in Tab. 4 show that our full model combining all components has the best mean score, and the baseline model, without any of our contributions, performs the worst. Multi-scale Feature-metric Consistency (MFC): Our self-enhanced MFC can continuously improve the multi-view consistency of the model, and we also elaborated the ablation results of its three main characteristics in Tab. 3. And the baseline is based on the pixel-wise feature consistency proposed in zhang2021learning . It can be seen that our strategy is more robust and efficiency. Generalized Multi-scale Volume (GMV): We show some results of the models have different resolutions in Fig. 8. We can see that the reconstruction of a single high-resolution volume is unbearably noisy (higher-resolution volume will lead to more empty voxels, which is more tricky for generalizable models.) and overly smooth at low resolution, whereas our GMV reconstructs clean and detailed geometry. And our representation is lighter than a single volume with a resolution of due to our thin feature. View Contrast Loss (VCL): As the results shown in Tab. 1 and Tab. 2, the reconstruction with dense inputs is more accurate than the sparse reconstruction, we therefore treat the former as a teacher to teach the latter. The results shown in Tab. 4 validate that this strategy can indeed improve the reconstruction quality of the model. The visualization of the ablation results are shown in Fig. 9, which further depicts that every contributions we propose can continuously improve performance.
Limitation.
Although our model exhibits excellent generalization performance in multi-view reconstruction, we found that it cannot satisfactorily handle scenes with large camera motion, such as surrounding cases. Because in these scenarios, the aggregation features will be polluted by the ray features shooting from behind. Our current solution is to first predict the local structure covered by some adjacent viewpoints like long2022sparseneus ; yao2018mvsnet ; peng2022rethinking , and finally fuse them together.
MFC | GMV | VCL | 24 | 37 | 40 | 55 | 63 | 65 | 69 | 83 | 97 | 105 | 106 | 110 | 114 | 118 | 122 | Mean |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
✗ | ✗ | ✗ | 2.26 | 3.39 | 2.04 | 1.27 | 2.47 | 2.65 | 1.62 | 1.84 | 1.61 | 1.32 | 1.82 | 1.94 | 0.91 | 1.78 | 1.62 | 1.90 |
✓ | ✗ | ✗ | 1.61 | 3.12 | 1.99 | 1.16 | 2.00 | 2.21 | 1.30 | 1.58 | 1.45 | 1.18 | 1.48 | 1.53 | 0.80 | 1.54 | 1.43 | 1.62 |
✓ | ✓ | ✗ | 1.51 | 3.07 | 1.88 | 0.97 | 1.56 | 2.11 | 1.12 | 1.45 | 1.31 | 0.95 | 1.20 | 1.02 | 0.64 | 1.32 | 1.24 | 1.42 |
✓ | ✓ | ✓ | 1.45 | 2.77 | 1.69 | 0.97 | 1.54 | 1.90 | 1.03 | 1.49 | 1.36 | 0.97 | 1.07 | 0.97 | 0.62 | 1.14 | 1.16 | 1.34 |
5 Conclusion
In this paper, we introduced GenS, an end-to-end generalizable neural surface reconstruction model. We first encode all scenes into our generalized multi-scale volume, a more powerful representation that can reconstruct clean and detailed 3D structures. Then we introduce the multi-scale feature-metric consistency to combat the challenge of the photometric consistency failure. The learnable multi-scale feature can provide more discriminative representation and can be self-enhanced during the generalization training. And we finally designed a view contrast loss to improve the accuracy of the reconstruction through distilling the finer reconstruction from dense inputs to the reconstruction from sparse inputs. Experimental results on both DTU and BlendedMVS datasets show that our model possess stronger generalization ability and can achieve start-of-the-art reconstruction through fast network inference or efficient fine-tuning. In the future, we will focus on improving the performance of the model in difficult scenarios.
Acknowledgments and Disclosure of Funding
This work is financially supported by National Natural Science Foundation of China U21B2012 and 62072013, Shenzhen Science and Technology Program-Shenzhen Cultivation of Excellent Scientific and Technological Innovation Talents project(Grant No. RCJC20200714114435057) , Shenzhen Science and Technology Program-Shenzhen Hong Kong joint funding project (Grant No. SGDX20211123144400001), this work is also financially supported for Outstanding Talents Training Fund in Shenzhen. In addition, we thank our collaborators in Alibaba Group and the anonymous reviewers for their valuable comments.
References
- [1] Connelly Barnes, Eli Shechtman, Adam Finkelstein, and Dan B Goldman. Patchmatch: A randomized correspondence algorithm for structural image editing. ACM TOG, 28(3):24, 2009.
- [2] Fausto Bernardini, Joshua Mittleman, Holly Rushmeier, Cláudio Silva, and Gabriel Taubin. The ball-pivoting algorithm for surface reconstruction. IEEE TVCG, 5(4):349–359, 1999.
- [3] Anpei Chen, Zexiang Xu, Fuqiang Zhao, Xiaoshuai Zhang, Fanbo Xiang, **gyi Yu, and Hao Su. Mvsnerf: Fast generalizable radiance field reconstruction from multi-view stereo. In ICCV, pages 14124–14133, 2021.
- [4] François Darmon, Bénédicte Bascle, Jean-Clément Devaux, Pascal Monasse, and Mathieu Aubry. Improving neural implicit surfaces geometry with patch war**. In CVPR, pages 6260–6269, 2022.
- [5] Qiancheng Fu, Qingshan Xu, Yew-Soon Ong, and Wenbing Tao. Geo-neus: geometry-consistent neural implicit surfaces learning for multi-view reconstruction. NeurIPS, 2022.
- [6] Yasutaka Furukawa and Jean Ponce. Accurate, dense, and robust multiview stereopsis. IEEE TPAMI, 32(8):1362–1376, 2009.
- [7] Silvano Galliani, Katrin Lasinger, and Konrad Schindler. Massively parallel multiview stereopsis by surface normal diffusion. In ICCV, pages 873–881, 2015.
- [8] Kyle Genova, Forrester Cole, Daniel Vlasic, Aaron Sarna, William T Freeman, and Thomas Funkhouser. Learning shape templates with structured implicit functions. In ICCV, pages 7154–7164, 2019.
- [9] Amos Gropp, Lior Yariv, Niv Haim, Matan Atzmon, and Yaron Lipman. Implicit geometric regularization for learning shapes. arXiv preprint arXiv:2002.10099, 2020.
- [10] Xiaodong Gu, Zhiwen Fan, Siyu Zhu, Zuozhuo Dai, Feitong Tan, and ** Tan. Cascade cost volume for high-resolution multi-view stereo and stereo matching. In CVPR, pages 2495–2504, 2020.
- [11] Shahram Izadi, David Kim, Otmar Hilliges, David Molyneaux, Richard Newcombe, Pushmeet Kohli, Jamie Shotton, Steve Hodges, Dustin Freeman, Andrew Davison, et al. Kinectfusion: real-time 3d reconstruction and interaction using a moving depth camera. In UIST, pages 559–568, 2011.
- [12] Rasmus Jensen, Anders Dahl, George Vogiatzis, Engin Tola, and Henrik Aanæs. Large scale multi-view stereopsis evaluation. In CVPR, pages 406–413, 2014.
- [13] Mohammad Mahdi Johari, Yann Lepoittevin, and François Fleuret. Geonerf: Generalizing nerf with geometry priors. In CVPR, pages 18365–18375, 2022.
- [14] Justin Johnson, Alexandre Alahi, and Li Fei-Fei. Perceptual losses for real-time style transfer and super-resolution. In ECCV, pages 694–711. Springer, 2016.
- [15] Michael Kazhdan, Matthew Bolitho, and Hugues Hoppe. Poisson surface reconstruction. In SGP, volume 7, page 0, 2006.
- [16] Michael Kazhdan and Hugues Hoppe. Screened poisson surface reconstruction. ACM ToG, 32(3):1–13, 2013.
- [17] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015.
- [18] Kiriakos N Kutulakos and Steven M Seitz. A theory of shape by space carving. IJCV, 38:199–218, 2000.
- [19] Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. In CVPR, pages 2117–2125, 2017.
- [20] Lingjie Liu, Jiatao Gu, Kyaw Zaw Lin, Tat-Seng Chua, and Christian Theobalt. Neural sparse voxel fields. NeurIPS, 33:15651–15663, 2020.
- [21] Shaohui Liu, Yinda Zhang, Songyou Peng, Boxin Shi, Marc Pollefeys, and Zhaopeng Cui. Dist: Rendering deep implicit signed distance function with differentiable sphere tracing. In CVPR, pages 2019–2028, 2020.
- [22] Shichen Liu, Shunsuke Saito, Weikai Chen, and Hao Li. Learning to infer implicit surfaces without 3d supervision. NeurIPS, 32, 2019.
- [23] ** Wang. Sparseneus: Fast generalizable neural surface reconstruction from sparse views. In ECCV, pages 210–227. Springer, 2022.
- [24] Nelson Max. Optical models for direct volume rendering. IEEE TVCG, 1(2):99–108, 1995.
- [25] Lars Mescheder, Michael Oechsle, Michael Niemeyer, Sebastian Nowozin, and Andreas Geiger. Occupancy networks: Learning 3d reconstruction in function space. In CVPR, pages 4460–4470, 2019.
- [26] Mateusz Michalkiewicz, Jhony K Pontes, Dominic Jack, Mahsa Baktashmotlagh, and Anders Eriksson. Implicit surface representations as layers in neural networks. In ICCV, pages 4743–4752, 2019.
- [27] Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis. In ECCV, pages 405–421, 2020.
- [28] Thomas Müller, Alex Evans, Christoph Schied, and Alexander Keller. Instant neural graphics primitives with a multiresolution hash encoding. ACM ToG, 41(4):1–15, 2022.
- [29] Zak Murez, Tarrence Van As, James Bartolozzi, Ayan Sinha, Vijay Badrinarayanan, and Andrew Rabinovich. Atlas: End-to-end 3d scene reconstruction from posed images. In ECCV, pages 414–431. Springer, 2020.
- [30] Michael Niemeyer, Lars Mescheder, Michael Oechsle, and Andreas Geiger. Occupancy flow: 4d reconstruction by learning particle dynamics. In ICCV, pages 5379–5389, 2019.
- [31] Michael Niemeyer, Lars Mescheder, Michael Oechsle, and Andreas Geiger. Differentiable volumetric rendering: Learning implicit 3d representations without 3d supervision. In CVPR, pages 3504–3515, 2020.
- [32] Matthias Nießner, Michael Zollhöfer, Shahram Izadi, and Marc Stamminger. Real-time 3d reconstruction at scale using voxel hashing. ACM ToG, 32(6):1–11, 2013.
- [33] Michael Oechsle, Lars Mescheder, Michael Niemeyer, Thilo Strauss, and Andreas Geiger. Texture fields: Learning texture representations in function space. In ICCV, pages 4531–4540, 2019.
- [34] Michael Oechsle, Songyou Peng, and Andreas Geiger. Unisurf: Unifying neural implicit surfaces and radiance fields for multi-view reconstruction. In ICCV, pages 5589–5599, 2021.
- [35] Jeong Joon Park, Peter Florence, Julian Straub, Richard Newcombe, and Steven Lovegrove. Deepsdf: Learning continuous signed distance functions for shape representation. In CVPR, pages 165–174, 2019.
- [36] Rui Peng, Rongjie Wang, Zhenyu Wang, Yawen Lai, and Ronggang Wang. Rethinking depth estimation for multi-view stereo: A unified representation. In CVPR, pages 8645–8654, 2022.
- [37] Songyou Peng, Michael Niemeyer, Lars Mescheder, Marc Pollefeys, and Andreas Geiger. Convolutional occupancy networks. In ECCV, pages 523–540. Springer, 2020.
- [38] Albert Pumarola, Enric Corona, Gerard Pons-Moll, and Francesc Moreno-Noguer. D-nerf: Neural radiance fields for dynamic scenes. In CVPR, pages 10318–10327, 2021.
- [39] Ke Qiu, Yawen Lai, Shiyi Liu, and Ronggang Wang. Self-supervised multi-view stereo via inter and intra network pseudo depth. In ACMMM, pages 2305–2313, 2022.
- [40] Yufan Ren, Fang**hua Wang, Tong Zhang, Marc Pollefeys, and Sabine Süsstrunk. Volrecon: Volume rendering of signed ray distance functions for generalizable multi-view reconstruction. In CVPR, 2023.
- [41] Leonid I Rudin and Stanley Osher. Total variation based image restoration with free local constraints. In ICIP, volume 1, pages 31–35. IEEE, 1994.
- [42] Shunsuke Saito, Zeng Huang, Ryota Natsume, Shigeo Morishima, Angjoo Kanazawa, and Hao Li. Pifu: Pixel-aligned implicit function for high-resolution clothed human digitization. In ICCV, pages 2304–2314, 2019.
- [43] Johannes L Schonberger and Jan-Michael Frahm. Structure-from-motion revisited. In CVPR, pages 4104–4113, 2016.
- [44] Johannes L Schönberger, Enliang Zheng, Jan-Michael Frahm, and Marc Pollefeys. Pixelwise view selection for unstructured multi-view stereo. In ECCV, pages 501–518. Springer, 2016.
- [45] Steven M Seitz and Charles R Dyer. Photorealistic scene reconstruction by voxel coloring. IJCV, 35:151–173, 1999.
- [46] Cheng Sun, Min Sun, and Hwann-Tzong Chen. Direct voxel grid optimization: Super-fast convergence for radiance fields reconstruction. In CVPR, pages 5459–5469, 2022.
- [47] Towaki Takikawa, Joey Litalien, Kangxue Yin, Karsten Kreis, Charles Loop, Derek Nowrouzezahrai, Alec Jacobson, Morgan McGuire, and Sanja Fidler. Neural geometric level of detail: Real-time rendering with implicit 3d shapes. In CVPR, pages 11358–11367, 2021.
- [48] Fang**hua Wang, Silvano Galliani, Christoph Vogel, Pablo Speciale, and Marc Pollefeys. Patchmatchnet: Learned multi-view patchmatch stereo. In CVPR, pages 14194–14203, 2021.
- [49] Peng Wang, Lingjie Liu, Yuan Liu, Christian Theobalt, Taku Komura, and Wen** Wang. Neus: Learning neural implicit surfaces by volume rendering for multi-view reconstruction. NeurIPS, 2021.
- [50] Qianqian Wang, Zhicheng Wang, Kyle Genova, Pratul P Srinivasan, Howard Zhou, Jonathan T Barron, Ricardo Martin-Brualla, Noah Snavely, and Thomas Funkhouser. Ibrnet: Learning multi-view image-based rendering. In CVPR, pages 4690–4699, 2021.
- [51] Yiqun Wang, Ivan Skorokhodov, and Peter Wonka. Hf-neus: Improved surface reconstruction using high-frequency details. NeurIPS, 35:1966–1978, 2022.
- [52] Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assessment: from error visibility to structural similarity. IEEE TIP, 13(4):600–612, 2004.
- [53] Tong Wu, Jiaqi Wang, Xingang Pan, Xudong Xu, Christian Theobalt, Ziwei Liu, and Dahua Lin. Voxurf: Voxel-based efficient and accurate neural surface reconstruction. In arXiv preprint arXiv:2208.12697, 2022.
- [54] Hongbin Xu, Zhipeng Zhou, Yu Qiao, Wenxiong Kang, and Qiuxia Wu. Self-supervised multi-view stereo via effective co-segmentation and data-augmentation.
- [55] Hongbin Xu, Zhipeng Zhou, Yali Wang, Wenxiong Kang, Baigui Sun, Hao Li, and Yu Qiao. Digging into uncertainty in self-supervised multi-view stereo. In ICCV, pages 6078–6087, 2021.
- [56] Qingshan Xu and Wenbing Tao. Multi-scale geometric consistency guided multi-view stereo. In CVPR, pages 5483–5492, 2019.
- [57] Jiayu Yang, Jose M Alvarez, and Miaomiao Liu. Self-supervised learning of depth inference for multi-view stereo. In CVPR, pages 7526–7534, 2021.
- [58] Yao Yao, Zixin Luo, Shiwei Li, Tian Fang, and Long Quan. Mvsnet: Depth inference for unstructured multi-view stereo. In ECCV, pages 767–783, 2018.
- [59] Yao Yao, Zixin Luo, Shiwei Li, **gyang Zhang, Yufan Ren, Lei Zhou, Tian Fang, and Long Quan. Blendedmvs: A large-scale dataset for generalized multi-view stereo networks. In CVPR, pages 1790–1799, 2020.
- [60] Lior Yariv, Jiatao Gu, Yoni Kasten, and Yaron Lipman. Volume rendering of neural implicit surfaces. NeurIPS, 34:4805–4815, 2021.
- [61] Lior Yariv, Yoni Kasten, Dror Moran, Meirav Galun, Matan Atzmon, Basri Ronen, and Yaron Lipman. Multiview neural surface reconstruction by disentangling geometry and appearance. NeurIPS, 33:2492–2502, 2020.
- [62] Alex Yu, Ruilong Li, Matthew Tancik, Hao Li, Ren Ng, and Angjoo Kanazawa. Plenoctrees for real-time rendering of neural radiance fields. In ICCV, pages 5752–5761, 2021.
- [63] Alex Yu, Vickie Ye, Matthew Tancik, and Angjoo Kanazawa. pixelnerf: Neural radiance fields from one or few images. In CVPR, pages 4578–4587, 2021.
- [64] Zehao Yu, Songyou Peng, Michael Niemeyer, Torsten Sattler, and Andreas Geiger. Monosdf: Exploring monocular geometric cues for neural implicit surface reconstruction. In NeurIPS, volume 35, pages 25018–25032, 2022.
- [65] **gyang Zhang, Yao Yao, and Long Quan. Learning signed distance field for multi-view surface reconstruction. In ICCV, pages 6525–6534, 2021.
- [66] Kai Zhang, Gernot Riegler, Noah Snavely, and Vladlen Koltun. Nerf++: Analyzing and improving neural radiance fields. arXiv preprint arXiv:2010.07492, 2020.