UniDepth: Universal Monocular Metric Depth Estimation

Luigi Piccinelli1  Yung-Hsu Yang1  Christos Sakaridis1
Mattia Segu1  Siyuan Li1  Luc Van Gool1,2  Fisher Yu1
1ETH Zürich  2INSAIT
Abstract

Accurate monocular metric depth estimation (MMDE) is crucial to solving downstream tasks in 3D perception and modeling. However, the remarkable accuracy of recent MMDE methods is confined to their training domains. These methods fail to generalize to unseen domains even in the presence of moderate domain gaps, which hinders their practical applicability. We propose a new model, UniDepth, capable of reconstructing metric 3D scenes from solely single images across domains. Departing from the existing MMDE methods, UniDepth directly predicts metric 3D points from the input image at inference time without any additional information, striving for a universal and flexible MMDE solution. In particular, UniDepth implements a self-promptable camera module predicting dense camera representation to condition depth features. Our model exploits a pseudo-spherical output representation, which disentangles camera and depth representations. In addition, we propose a geometric invariance loss that promotes the invariance of camera-prompted depth features. Thorough evaluations on ten datasets in a zero-shot regime consistently demonstrate the superior performance of UniDepth, even when compared with methods directly trained on the testing domains. Code and models are available at: github.com/lpiccinelli-eth/unidepth.

1 Introduction

Refer to caption
Figure 1: We introduce UniDepth, a novel approach that directly predicts 3D points in a scene with only one image as input. UniDepth incorporates a camera self-prompting mechanism and leverages a pseudo-spherical 3D output space defined by azimuth and elevation angles, and depth (θ𝜃\thetaitalic_θ, ϕitalic-ϕ\phiitalic_ϕ, z𝑧zitalic_z). This design effectively separates camera and depth optimization by avoiding gradient flowing to the camera module due to depth-related error (εzsubscript𝜀𝑧\varepsilon_{z}italic_ε start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT).

The precise pixel-wise depth estimation is crucial to understanding the geometric scene structure, with applications in 3D modeling [10], robotics [63, 11], and autonomous vehicles [51, 38]. However, delivering reliable metric scaled depth outputs is necessary to perform 3D reconstruction effectively, thus motivating the challenging and inherently ill-posed task of Monocular Metric Depth Estimation (MMDE).

While existing MMDE methods [14, 16, 3, 43, 40, 61, 41] have demonstrated remarkable accuracy across different benchmarks, they require training and testing on datasets with similar camera intrinsics and scene scales. Moreover, the training datasets typically have a limited size and contain little diversity in scenes and cameras. These characteristics result in poor generalization to real-world inference scenarios [52], where images are captured in uncontrolled, arbitrarily structured environments and cameras with arbitrary intrinsics.

Only a few methods [59, 21] have addressed the challenging task of generalizable MMDE. However, these methods assume controlled setups at test time, including camera intrinsics. While this assumption simplifies the task, it has two notable drawbacks. Firstly, it does not address the full application spectrum, e.g. in-the-wild video processing and crowd-sourced image analysis. Secondly, the inherent camera parameter noise is directly injected into the model, leading to large inaccuracies in the high-noise case.

In this work, we address the more demanding task of generalizable MMDE without any reliance on additional external information, such as camera parameters, thus defining the universal MMDE task. Our approach, named UniDepth, is the first that attempts to solve this challenging task without restrictions on scene composition and setup and distinguishes itself through its general and adaptable nature. Unlike existing methods, UniDepth delivers metric 3D predictions for any scene solely from a single image, waiving the need for extra information about scene or camera. Furthermore, UniDepth flexibly allows for the incorporation of additional camera information at test time.

Our design introduces a camera module that outputs a non-parametric, i.e. dense camera representation, serving as the prompt to the depth module. However, relying only on this single additional module clearly results in challenges related to training stability and scale ambiguity. We propose an effective pseudo-spherical representation of the output space to disentangle the camera and depth dimensions of this space. This representation employs azimuth and elevation angle components for the camera and a radial component for the depth, forming a perfect orthogonal space between the camera plane and the depth axis. Moreover, the camera components are embedded through Laplace spherical harmonic encoding. Figure 1 depicts our camera self-prompting mechanism and the output space. Additionally, we introduce a geometric invariance loss to enhance the robustness of depth estimation. The underlying idea is that the camera-conditioned depth features from two views of the same image should exhibit reciprocal consistency. In particular, we sample two geometric augmentations, creating a pair of different views for each training image, thus simulating different apparent cameras for the original scene.

Our overall contribution is the first universal MMDE method, UniDepth, that predicts a point in metric 3D space for each pixel without any input other than a single image. In particular, first, we design a promptable camera module, an architectural component that learns a dense camera representation and allows for non-parametric camera conditioning. Second, we propose a pseudo-spherical representation of the output space, thus solving the intertwined nature of camera and depth prediction. In addition, we introduce a geometric invariance loss to disentangle the camera information from the underlying 3D geometry of the scene. Moreover, we extensively test UniDepth and re-evaluate seven MMDE State-of-the-Art (SotA) methods on ten different datasets in a fair and comparable zero-shot setup to lay the ground for the generalized MMDE task. Owing to its design, UniDepth consistently sets the new state of the art even compared with non-zero-shot methods, ranking first in the competitive official KITTI Depth Prediction Benchmark.

2 Related Work

Refer to caption
Figure 2: Model Architecture. UniDepth utilizes solely the input image to generate the 3D output (𝐎𝐎\mathbf{O}bold_O). It bootstraps dense camera prediction (𝐂𝐂\mathbf{C}bold_C) from the Camera Module, injecting prior knowledge on scene scale into the Depth Module via a cross-attention layer. The camera representation corresponds to azimuth and elevation angles. The geometric invariance loss (consubscriptcon\mathcal{L}_{\mathrm{con}}caligraphic_L start_POSTSUBSCRIPT roman_con end_POSTSUBSCRIPT) enforces consistency between depth features tensors conditioned on the camera from different geometric augmentations (𝒯1subscript𝒯1\mathcal{T}_{1}caligraphic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, 𝒯2subscript𝒯2\mathcal{T}_{2}caligraphic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT). Stop-gradient is applied to the encoded feature (𝐅𝐅\mathbf{F}bold_F) flowing to the Camera Module to prevent the camera gradient from dominating the depth gradient in the encoder. The depth output (𝐙logsubscript𝐙\mathbf{Z}_{\log}bold_Z start_POSTSUBSCRIPT roman_log end_POSTSUBSCRIPT) is obtained through three self-attention blocks interleaved with learnable 2x2x2\mathrm{x}2 roman_x upsampling. The final output is the concatenation of the camera and depth tensors (𝐂||𝐙log\mathbf{C}||\mathbf{Z}_{\log}bold_C | | bold_Z start_POSTSUBSCRIPT roman_log end_POSTSUBSCRIPT), creating two independent optimization spaces for λMSEsubscript𝜆𝑀𝑆𝐸\mathcal{L}_{\lambda MSE}caligraphic_L start_POSTSUBSCRIPT italic_λ italic_M italic_S italic_E end_POSTSUBSCRIPT.

Metric and Scale-agnostic Depth Estimation. It is crucial to distinguish Monocular Metric Depth Estimation (MMDE) from scale-agnostic, namely up-to-a-scale, monocular depth estimation. MMDE SotA approaches typically confine training and testing to the same domain. However, challenges arise, such as overfitting to the training scenario leading to considerable performance drops in the presence of minor domain gaps, often overlooked in benchmarks like NYU-Depthv2 [35] (NYU) and KITTI [18]. On the other hand, scale-agnostic depth methods, including MiDaS [42], OmniData [13], and LeReS [58], show robust generalization by training on extensive datasets. Their limitation lies in the absence of a metric output, hindering practical usage in downstream applications.

Monocular Metric Depth Estimation. The introduction of end-to-end trainable neural networks in MMDE, pioneered by [14], marked a significant milestone, also introducing the optimization process through the Scale-Invariant log loss (SIlogsubscriptSI\mathrm{SI}_{\log}roman_SI start_POSTSUBSCRIPT roman_log end_POSTSUBSCRIPT). Subsequent developments witnessed the emergence of advanced networks, ranging from convolution-based architectures [16, 27, 31, 40] to transformer-based approaches [57, 3, 61, 41]. Despite impressive achievements on established benchmarks, MMDE models face challenges in zero-shot scenarios, revealing the need for robust generalization against domain shifts in appearance and geometry.

General Monocular Metric Depth Estimation. Recent efforts focus on develo** MMDE models [4, 21, 59] for general depth prediction across diverse domains. These models often leverage camera awareness, either by directly incorporating external camera parameters into computations [15, 21] or by normalizing the shape or output depth based on intrinsic properties, as seen in [28, 1, 59].

However, these generalizable MMDE methods often adopt specific strategies to enhance performance, e.g. geometric pretraining [4] or dataset-specific prior like resha** [59]. In addition, these methods assume access to noiseless camera intrinsics both at training and test time, also limiting their applicability to pinhole camera models. Additionally, SotA methods depend on a predefined backprojection operation, blurring the distinction between learning depth and the 3D scene. In contrast, our approach aims to overcome these limitations, presenting a more demanding perspective, e.g. universal MMDE. Universal MMDE involves directly predicting the 3D scene from the input image without any additional information other than the latter. Notably, we do not require any additional prior information at test time, such as access to camera information.

3 UniDepth

MMDE SotA methods typically assume access to the camera intrinsics, thus blurring the line between pure depth estimation and actual 3D estimation. In contrast, UniDepth aims to create a universal MMDE model deployable in diverse scenarios without relying on any other external information, such as camera intrinsic, thus leading to 3D space estimation by design. However, attempting to directly predict 3D points from a single image without a proper internal representation neglects geometric prior knowledge, i.e. perspective geometry, burdening the learning process with re-learning laws of perspective projection from data.

Sec. 3.1 introduces a pseudo-spherical representation of the output space to inherently disentangle camera rays’ angles from depth. In addition, our preliminary studies indicate that depth prediction clearly benefits from prior information on the acquisition sensor, leading to the introduction of a self-prompting camera operation in Sec. 3.2. Further disentanglement at the level of internal depth features is achieved through a geometric invariance loss, outlined in Sec. 3.3. This loss ensures depth features remain invariant when conditioned on the bootstrapped camera predictions, promoting robust camera-aware depth predictions. The overall architecture and the resulting optimization induced by the combination of design choices are detailed in Sec. 3.4.

3.1 3D Representation

The general purpose nature of our MMDE method requires inferring both depth and camera intrinsics to make 3D predictions based only on imagery observations. We design the 3D output space presenting a natural disentanglement of the two sub-tasks, namely depth estimation and camera calibration. In particular, we exploit the pseudo-spherical representation where the basis is defined by azimuth, elevation, and log-depth, i.e. (θ𝜃\thetaitalic_θ,ϕitalic-ϕ\phiitalic_ϕ,zlogsubscript𝑧z_{\log}italic_z start_POSTSUBSCRIPT roman_log end_POSTSUBSCRIPT), in contrast to the Cartesian representation (x𝑥xitalic_x,y𝑦yitalic_y,z𝑧zitalic_z). The strength of the proposed pseudo-spherical representation lies in the decoupling of camera (θ𝜃\thetaitalic_θ,ϕitalic-ϕ\phiitalic_ϕ) and depth (zlogsubscript𝑧z_{\log}italic_z start_POSTSUBSCRIPT roman_log end_POSTSUBSCRIPT) components, ensuring their orthogonality by design, in contrast to the entanglement present in Cartesian representation.

It is worth highlighting that in this output space, the non-parametric dense representation of the camera is mathematically represented as a tensor 𝐂H×W×2𝐂superscript𝐻𝑊2\mathbf{C}\in\mathbb{R}^{H\times W\times 2}bold_C ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W × 2 end_POSTSUPERSCRIPT, where H𝐻Hitalic_H and W𝑊Witalic_W are the height and width of the input image and the last dimension corresponds to azimuth and elevation values. While in the typical Cartesian space, the backprojection involves the multiplication of homogeneous camera rays and depth, the backprojection operation in the proposed representation space accounts for the concatenation of camera and depth representations. The pencil of rays are defined as (𝐫1,𝐫2,𝐫3)=𝐊1[𝐮,𝐯,𝟏]Tsubscript𝐫1subscript𝐫2subscript𝐫3superscript𝐊1superscript𝐮𝐯1𝑇(\mathbf{r}_{1},\mathbf{r}_{2},\mathbf{r}_{3})=\mathbf{K}^{-1}[\mathbf{u},% \mathbf{v},\mathbf{1}]^{T}( bold_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_r start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , bold_r start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) = bold_K start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT [ bold_u , bold_v , bold_1 ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT, where 𝐊𝐊\mathbf{K}bold_K is the calibration matrix, 𝐮𝐮\mathbf{u}bold_u and 𝐯𝐯\mathbf{v}bold_v are pixel positions in pixel coordinates, and 𝟏1\mathbf{1}bold_1 is a vector of ones. Therefore, the homogeneous camera rays (𝐫x,𝐫y)subscript𝐫𝑥subscript𝐫𝑦(\mathbf{r}_{x},\mathbf{r}_{y})( bold_r start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , bold_r start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ) correspond to (𝐫1𝐫3,𝐫2𝐫3)subscript𝐫1subscript𝐫3subscript𝐫2subscript𝐫3(\frac{\mathbf{r}_{1}}{\mathbf{r}_{3}},\frac{\mathbf{r}_{2}}{\mathbf{r}_{3}})( divide start_ARG bold_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_ARG bold_r start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_ARG , divide start_ARG bold_r start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG bold_r start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_ARG ).

Moreover, the angular dense representation can be embedded via the Laplace Spherical Harmonic Encoding (SHE). The camera embedding tensor is defined as 𝐄=SHE(𝐂),𝐄H×W×dformulae-sequence𝐄SHE𝐂𝐄superscript𝐻𝑊𝑑\mathbf{E}=\mathrm{SHE}(\mathbf{C}),\mathbf{E}\in\mathbb{R}^{H\times W\times d}bold_E = roman_SHE ( bold_C ) , bold_E ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W × italic_d end_POSTSUPERSCRIPT, where d𝑑ditalic_d is the number of harmonics chosen. SHE()SHE\mathrm{SHE}(\cdot)roman_SHE ( ⋅ ) computes the set of spherical harmonics, i.e., {𝒴}l,msubscript𝒴𝑙𝑚\{\mathcal{Y}\}_{l,m}{ caligraphic_Y } start_POSTSUBSCRIPT italic_l , italic_m end_POSTSUBSCRIPT with degree l𝑙litalic_l and order m𝑚mitalic_m, and concatenating along the channel dimension, with 𝐘mlsubscriptsuperscript𝐘𝑙𝑚\mathbf{Y}^{l}_{m}bold_Y start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT as

Yml(θ,ϕ)=αml𝒫ml(cosθ)eimϕ,subscriptsuperscript𝑌𝑙𝑚𝜃italic-ϕsubscriptsuperscript𝛼𝑙𝑚subscriptsuperscript𝒫𝑙𝑚𝜃superscript𝑒𝑖𝑚italic-ϕY^{l}_{m}(\theta,\phi)=\alpha^{l}_{m}\mathcal{P}^{l}_{m}(\cos\theta)e^{im\phi},italic_Y start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_θ , italic_ϕ ) = italic_α start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT caligraphic_P start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( roman_cos italic_θ ) italic_e start_POSTSUPERSCRIPT italic_i italic_m italic_ϕ end_POSTSUPERSCRIPT , (1)

where 𝒫mlsubscriptsuperscript𝒫𝑙𝑚\mathcal{P}^{l}_{m}caligraphic_P start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT is the associated Legendre polynomial of degree l𝑙litalic_l and order m𝑚mitalic_m, and αmlsubscriptsuperscript𝛼𝑙𝑚\alpha^{l}_{m}italic_α start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT is a normalizing constant. In particular, the spherical harmonics on the unit sphere form an orthogonal basis of the spherical manifold and preserve inner products. The total number of harmonics utilized is 81818181, resulting from cap** the degree l𝑙litalic_l to 8. SHE is utilized as a mathematic sounder choice compared to, e.g. the Fourier Transform, to produce the camera embeddings.

3.2 Self-Promptable Camera

The camera module plays a crucial role in the final 3D predictions since its angular dense output accounts for two dimensions of the output space, namely azimuth and elevation. Most importantly, these embeddings prompt the depth module to ensure a bootstrapped prior knowledge of the input scene’s global depth scale. The prompting is fundamental to avoid mode collapse in the scene scale and to alleviate the depth module from the burden of predicting depth from scratch as the scale is already modeled by camera output.

Nonetheless, the internal representation of the camera module is based on a pinhole parameterization, namely via focal length (fxsubscript𝑓𝑥f_{x}italic_f start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT, fysubscript𝑓𝑦f_{y}italic_f start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT) and principal point (cxsubscript𝑐𝑥c_{x}italic_c start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT, cysubscript𝑐𝑦c_{y}italic_c start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT). The four tokens conceptually corresponding to the intrinsics are then projected to scalar values, i.e., ΔfxΔsubscript𝑓𝑥\Delta f_{x}roman_Δ italic_f start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT, ΔfyΔsubscript𝑓𝑦\Delta f_{y}roman_Δ italic_f start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT, ΔcxΔsubscript𝑐𝑥\Delta c_{x}roman_Δ italic_c start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT, ΔcyΔsubscript𝑐𝑦\Delta c_{y}roman_Δ italic_c start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT. However, they do not directly represent the camera parameters, but the multiplicative residuals to a pinhole camera initialization, namely H2𝐻2\frac{H}{2}divide start_ARG italic_H end_ARG start_ARG 2 end_ARG for y-components and W2𝑊2\frac{W}{2}divide start_ARG italic_W end_ARG start_ARG 2 end_ARG for x-components, leading to fx=ΔfxW2subscript𝑓𝑥Δsubscript𝑓𝑥𝑊2f_{x}=\frac{\Delta f_{x}W}{2}italic_f start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT = divide start_ARG roman_Δ italic_f start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT italic_W end_ARG start_ARG 2 end_ARG, fy=ΔfyH2subscript𝑓𝑦Δsubscript𝑓𝑦𝐻2f_{y}=\frac{\Delta f_{y}H}{2}italic_f start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT = divide start_ARG roman_Δ italic_f start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT italic_H end_ARG start_ARG 2 end_ARG, cx=ΔcxW2subscript𝑐𝑥Δsubscript𝑐𝑥𝑊2c_{x}=\frac{\Delta c_{x}W}{2}italic_c start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT = divide start_ARG roman_Δ italic_c start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT italic_W end_ARG start_ARG 2 end_ARG, cy=ΔcyH2subscript𝑐𝑦Δsubscript𝑐𝑦𝐻2c_{y}=\frac{\Delta c_{y}H}{2}italic_c start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT = divide start_ARG roman_Δ italic_c start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT italic_H end_ARG start_ARG 2 end_ARG, leading to invariance towards input image sizes.

Subsequently, a backprojection operation based on the intrinsic parameters is applied to every pixel coordinate to produce the corresponding rays. The rays are normalized and thus represent vectors on a unit sphere. The critical step involves extracting azimuth and elevation from the backprojected rays, effectively creating a “dense” angular camera representation. This dense representation undergoes SHE to produce the embeddings 𝐄𝐄\mathbf{E}bold_E. The embedded representations are then seamlessly passed to the depth module as a prompt, where they play a vital role as a conditioning factor. The conditioning is enforced via a cross-attention layer between the initialized feature of Depth Module 𝐃h×w×C𝐃superscript𝑤𝐶\mathbf{D}\in\mathbb{R}^{h\times w\times C}bold_D ∈ blackboard_R start_POSTSUPERSCRIPT italic_h × italic_w × italic_C end_POSTSUPERSCRIPT and the camera embeddings 𝐄𝐄\mathbf{E}bold_E where (h,w)=(H/16,W/16)𝑤𝐻16𝑊16(h,w)=(H/16,W/16)( italic_h , italic_w ) = ( italic_H / 16 , italic_W / 16 ). The camera-prompted depth features 𝐃|𝐄h×w×Cconditional𝐃𝐄superscript𝑤𝐶\mathbf{D|E}\in\mathbb{R}^{h\times w\times C}bold_D | bold_E ∈ blackboard_R start_POSTSUPERSCRIPT italic_h × italic_w × italic_C end_POSTSUPERSCRIPT are defined as

𝐃|𝐄=MLP(CA(𝐃,𝐄)),conditional𝐃𝐄MLPCA𝐃𝐄\mathbf{D|E}=\mathrm{MLP}(\mathrm{CA}(\mathbf{D},\mathbf{E})),bold_D | bold_E = roman_MLP ( roman_CA ( bold_D , bold_E ) ) , (2)

where CACA\mathrm{CA}roman_CA is a cross-attention block and MLPMLP\mathrm{MLP}roman_MLP is a MultiLayer Perceptron with one 4C4𝐶4C4 italic_C-channel hidden layer.

Figure 3 illustrates one of the main benefits of our camera module. In particular, in high-noise intrinsics or camera-agnostic scenarios, UniDepth can bootstrap the camera prediction, thus displaying total noise insensitivity. However, we can substitute the camera module output to improve 3D reconstruction peak performance if any external dense camera representation is provided. This adaptability enhances the model’s versatility, allowing it to operate seamlessly in diverse setups. Moreover, Figure 3 suggests that training with noisy self-prompts enhances the robustness of UniDepth to noisier external intrinsics if given at test time.

3.3 Geometric Invariance Loss

The spatial locations from the same scene captured by different cameras should correspond when the depth module is conditioned on the specific camera. To this end, we propose a geometric invariance loss to enforce the consistency of camera-prompted depth features of the same scene from different acquisition sensors. In particular, consistency is enforced on features extracted from identical 3D locations.

For each image, we perform N𝑁Nitalic_N distinct geometrical augmentations, denoted as {𝒯i}i=1Nsuperscriptsubscriptsubscript𝒯𝑖𝑖1𝑁\{\mathcal{T}_{i}\}_{i=1}^{N}{ caligraphic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT, with N=2𝑁2N=2italic_N = 2 in our experiments. This operation involves involves sampling a rescaling factor r2𝒰[1,1]similar-to𝑟superscript2subscript𝒰11r\sim 2^{\mathcal{U}_{[-1,1]}}italic_r ∼ 2 start_POSTSUPERSCRIPT caligraphic_U start_POSTSUBSCRIPT [ - 1 , 1 ] end_POSTSUBSCRIPT end_POSTSUPERSCRIPT and a relative translation on the x𝑥xitalic_x-axis t𝒰[0.1,0.1]similar-to𝑡subscript𝒰0.10.1t\sim\mathcal{U}_{[-0.1,0.1]}italic_t ∼ caligraphic_U start_POSTSUBSCRIPT [ - 0.1 , 0.1 ] end_POSTSUBSCRIPT, then crop** it to the network’s input shape. This is analogous to sampling a pair of images from the same scene and extrinsic parameters but captured by different cameras. Let 𝐂isubscript𝐂𝑖\mathbf{C}_{i}bold_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and 𝐃𝐢|𝐄𝐢conditionalsubscript𝐃𝐢subscript𝐄𝐢\mathbf{D_{i}|E_{i}}bold_D start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT | bold_E start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT describe the predicted camera representation and camera-prompted depth features, respectively, corresponding to augmentation 𝒯isubscript𝒯𝑖\mathcal{T}_{i}caligraphic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. It is evident that the camera representations differ when two diverse geometric augmentations are applied, i.e., 𝐂i𝐂jsubscript𝐂𝑖subscript𝐂𝑗\mathbf{C}_{i}\neq\mathbf{C}_{j}bold_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≠ bold_C start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT if 𝒯i𝒯jsubscript𝒯𝑖subscript𝒯𝑗\mathcal{T}_{i}\neq\mathcal{T}_{j}caligraphic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≠ caligraphic_T start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT. Therefore, the geometric invariance loss can be expressed as

con(𝐃1|𝐄1,𝐃2|𝐄2)=𝒯2𝒯11(𝐃1|𝐄1)sg(𝐃2|𝐄2)1,\begin{split}&\mathcal{L}_{\mathrm{con}}(\mathbf{D}_{1}|\mathbf{E}_{1},\mathbf% {D}_{2}|\mathbf{E}_{2})=\\ &\left\|\mathcal{T}_{2}\circ\mathcal{T}^{-1}_{1}\circ(\mathbf{D}_{1}|\mathbf{E% }_{1})-\mathrm{sg}(\mathbf{D}_{2}|\mathbf{E}_{2})\right\|_{1},\end{split}start_ROW start_CELL end_CELL start_CELL caligraphic_L start_POSTSUBSCRIPT roman_con end_POSTSUBSCRIPT ( bold_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | bold_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_D start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | bold_E start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ∥ caligraphic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∘ caligraphic_T start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∘ ( bold_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | bold_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) - roman_sg ( bold_D start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | bold_E start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , end_CELL end_ROW (3)

where 𝐃i|𝐄iconditionalsubscript𝐃𝑖subscript𝐄𝑖\mathbf{D}_{i}|\mathbf{E}_{i}bold_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | bold_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT represents the depth feature after being conditioned by camera prompt 𝐄isubscript𝐄𝑖\mathbf{E}_{i}bold_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, as outlined in Sec. 3.2, and sg()sg\mathrm{sg}(\cdot)roman_sg ( ⋅ ) corresponds to the stop-gradient detach operation needed to exploit 𝐃2|𝐄2conditionalsubscript𝐃2subscript𝐄2\mathbf{D}_{2}|\mathbf{E}_{2}bold_D start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | bold_E start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT as pseudo ground-truth (GT). The bi-directional loss can be computed as: 12(con(𝐃1|𝐄1,𝐃2|𝐄2)+con(𝐃2|𝐄2,𝐃1|𝐄1))\frac{1}{2}(\mathcal{L}_{\mathrm{con}}(\mathbf{D}_{1}|\mathbf{E}_{1},\mathbf{D% }_{2}|\mathbf{E}_{2})+\mathcal{L}_{\mathrm{con}}(\mathbf{D}_{2}|\mathbf{E}_{2}% ,\mathbf{D}_{1}|\mathbf{E}_{1}))divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( caligraphic_L start_POSTSUBSCRIPT roman_con end_POSTSUBSCRIPT ( bold_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | bold_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_D start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | bold_E start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) + caligraphic_L start_POSTSUBSCRIPT roman_con end_POSTSUBSCRIPT ( bold_D start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | bold_E start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , bold_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | bold_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ). It is necessary to apply the geometric invariance loss after the features are conditioned on the viewing information, i.e., camera. Otherwise, the loss would enforce consistency across features that inherently carry distinct camera information.

Refer to caption
Figure 3: Impact of noise in camera intrinsics. The amount of relative distortion (εCAM(%)\varepsilon_{\mathrm{CAM}(\%)}italic_ε start_POSTSUBSCRIPT roman_CAM ( % ) end_POSTSUBSCRIPT) of the intrinsics is shown on the x-axis, while δ0.5subscript𝛿0.5\delta_{0.5}italic_δ start_POSTSUBSCRIPT 0.5 end_POSTSUBSCRIPT performance on OOD test sets on the y-axis. Relying on external input inherently leads to being subject to its noise. UniDepth functions in dual regimes, with and without external intrinsic. In situations of unknown intrinsics or high noise, UniDepth exhibits total robustness by bootstrap** camera prediction (Ours). In contrast, with low-noise intrinsics, we leverage it for enhanced peak performance (Ours-CAM).

3.4 Network Design

Architecture. Our network, described in Fig. 2, comprises an Encoder Backbone, a Camera Module, and a Depth Module. The encoder can be either convolutional or ViT-based [12], producing features at different “scales”, i.e. 𝐅h×w×C×B𝐅superscript𝑤𝐶𝐵\mathbf{F}\in\mathbb{R}^{h\times w\times C\times B}bold_F ∈ blackboard_R start_POSTSUPERSCRIPT italic_h × italic_w × italic_C × italic_B end_POSTSUPERSCRIPT, where (h,w)=(H16,W16)𝑤𝐻16𝑊16(h,w)=(\frac{H}{16},\frac{W}{16})( italic_h , italic_w ) = ( divide start_ARG italic_H end_ARG start_ARG 16 end_ARG , divide start_ARG italic_W end_ARG start_ARG 16 end_ARG ) and B=4𝐵4B=4italic_B = 4.

The Camera Module parameters are initialized class tokens for ViT-style or pooled feature maps for convolutional-style backbones. The encoded features from the Encoder Backbone are passed to the Camera Module as a stack of detached tokens, the encoder class tokens are utilized as camera parameters initialization. The features are processed to obtain the final dense representation 𝐂𝐂\mathbf{C}bold_C as detailed in Sec. 3.2, and further embedded to 𝐄𝐄\mathbf{E}bold_E via SHE()SHE\mathrm{SHE}(\cdot)roman_SHE ( ⋅ ) outlined in Sec. 3.1. Note that the stop-gradient operation is necessary because of the low variety of effective cameras compared to the image diversity. In fact, the Camera Module component easily overfits and clearly dominates the overall backbone gradient.

The Depth Module is fed with the encoder features to condition the initial latent features 𝐋h×w×C𝐋superscript𝑤𝐶\mathbf{L}\in\mathbb{R}^{h\times w\times C}bold_L ∈ blackboard_R start_POSTSUPERSCRIPT italic_h × italic_w × italic_C end_POSTSUPERSCRIPT via one cross-attention layer to obtain the initial depth features, 𝐃𝐃\mathbf{D}bold_D. The latent feature tensor 𝐋𝐋\mathbf{L}bold_L is obtained as the average of the features 𝐅𝐅\mathbf{F}bold_F along the B𝐵Bitalic_B dimension. Furthermore, the depth features are conditioned on the camera prompts 𝐄𝐄\mathbf{E}bold_E to obtain 𝐃|𝐄conditional𝐃𝐄\mathbf{D|E}bold_D | bold_E as described in Sec. 3.2. The camera-prompted depth features are further processed via self-attention layers where the positional encoding utilized is 𝐄𝐄\mathbf{E}bold_E and upsampled to produce a multi-scale output. The log-depth prediction 𝐙logH×W×1subscript𝐙superscript𝐻𝑊1\mathbf{Z}_{\log}\in\mathbb{R}^{H\times W\times 1}bold_Z start_POSTSUBSCRIPT roman_log end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W × 1 end_POSTSUPERSCRIPT corresponds to the mean of the interpolated intermediate representations. The final 3D output 𝐎H×W×3𝐎superscript𝐻𝑊3\mathbf{O}\in\mathbb{R}^{H\times W\times 3}bold_O ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W × 3 end_POSTSUPERSCRIPT is the concatenation of predicted rays and depth, 𝐎=𝐂||𝐙\mathbf{O}=\mathbf{C}||\mathbf{Z}bold_O = bold_C | | bold_Z, with 𝐙𝐙\mathbf{Z}bold_Z as element-wise exponentiation of 𝐙logsubscript𝐙\mathbf{Z}_{\log}bold_Z start_POSTSUBSCRIPT roman_log end_POSTSUBSCRIPT.

Optimization. The optimization process is guided by a re-formulation of the Mean Squared Error (MSE) loss in the final 3D output space (θ𝜃\thetaitalic_θ,ϕitalic-ϕ\phiitalic_ϕ,zlogsubscript𝑧z_{\log}italic_z start_POSTSUBSCRIPT roman_log end_POSTSUBSCRIPT) from Sec. 3.1 as:

λMSE(𝜺)=𝕍[𝜺]1+𝝀T(𝔼[𝜺]𝔼[𝜺]),subscript𝜆MSE𝜺subscriptdelimited-∥∥𝕍delimited-[]𝜺1superscript𝝀𝑇direct-product𝔼delimited-[]𝜺𝔼delimited-[]𝜺\vspace{-4pt}\begin{split}\mathcal{L}_{\lambda\mathrm{MSE}}(\bm{\varepsilon})=% \|\mathbb{V}[\bm{\varepsilon}]\|_{1}+\bm{\lambda}^{T}(\mathbb{E}[\bm{% \varepsilon}]\odot\mathbb{E}[\bm{\varepsilon}]),\end{split}start_ROW start_CELL caligraphic_L start_POSTSUBSCRIPT italic_λ roman_MSE end_POSTSUBSCRIPT ( bold_italic_ε ) = ∥ blackboard_V [ bold_italic_ε ] ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + bold_italic_λ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( blackboard_E [ bold_italic_ε ] ⊙ blackboard_E [ bold_italic_ε ] ) , end_CELL end_ROW (4)

where 𝜺=𝐨^𝐨3𝜺^𝐨superscript𝐨superscript3\bm{\varepsilon}=\hat{\mathbf{o}}-\mathbf{o}^{*}\in\mathbb{R}^{3}bold_italic_ε = over^ start_ARG bold_o end_ARG - bold_o start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT, 𝐨^=(θ^,ϕ^,z^log)^𝐨^𝜃^italic-ϕsubscript^𝑧\hat{\mathbf{o}}=(\hat{\theta},\hat{\phi},\hat{z}_{\log})over^ start_ARG bold_o end_ARG = ( over^ start_ARG italic_θ end_ARG , over^ start_ARG italic_ϕ end_ARG , over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT roman_log end_POSTSUBSCRIPT ) is the predicted 3D output, 𝐨=(θ,ϕ,zlog)superscript𝐨superscript𝜃superscriptitalic-ϕsuperscriptsubscript𝑧\mathbf{o}^{*}=(\theta^{*},\phi^{*},z_{\log}^{*})bold_o start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = ( italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_ϕ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_z start_POSTSUBSCRIPT roman_log end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) is the GT 3D value, and 𝝀=(λθ,λϕ,λz)3𝝀subscript𝜆𝜃subscript𝜆italic-ϕsubscript𝜆𝑧superscript3\bm{\lambda}=(\lambda_{\theta},\lambda_{\phi},\lambda_{z})\in\mathbb{R}^{3}bold_italic_λ = ( italic_λ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT , italic_λ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT , italic_λ start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT is a vector of weights for each dimension of the output. 𝕍[𝜺]𝕍delimited-[]𝜺\mathbb{V}[\bm{\varepsilon}]blackboard_V [ bold_italic_ε ] and 𝔼[𝜺]𝔼delimited-[]𝜺\mathbb{E}[\bm{\varepsilon}]blackboard_E [ bold_italic_ε ] are computed as the vectors of empirical variances and means for each of the three output dimensions over all pixels, i.e. {𝜺(i)}i=1Nsuperscriptsubscriptsuperscript𝜺𝑖𝑖1𝑁\{\bm{\varepsilon}^{(i)}\}_{i=1}^{N}{ bold_italic_ε start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT. Note that if λd=1subscript𝜆𝑑1\lambda_{d}=1italic_λ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT = 1 for a dimension d𝑑ditalic_d, the loss represents the standard MSE loss for that dimension. If λd<1subscript𝜆𝑑1\lambda_{d}<1italic_λ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT < 1, a scale-invariant loss term is added to that dimension if it is expressed in log space, e.g. for the depth dimension zlogsubscript𝑧z_{\log}italic_z start_POSTSUBSCRIPT roman_log end_POSTSUBSCRIPT and a shift-invariant loss term is added if that output is expressed in linear space. In particular, if only the last output dimension is considered, i.e., the one corresponding to depth, and λz=0.15subscript𝜆𝑧0.15\lambda_{z}=0.15italic_λ start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT = 0.15 is utilized, the corresponding loss is the standard SIlogsubscriptSI\mathrm{SI}_{\log}roman_SI start_POSTSUBSCRIPT roman_log end_POSTSUBSCRIPT. In our experiments, we set λθ=λϕ=1subscript𝜆𝜃subscript𝜆italic-ϕ1\lambda_{\theta}=\lambda_{\phi}=1italic_λ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT = italic_λ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT = 1 and λz=0.15subscript𝜆𝑧0.15\lambda_{z}=0.15italic_λ start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT = 0.15. Therefore, the final optimization loss is defined as

=λMSE+αcon, with α=0.1.formulae-sequencesubscript𝜆MSE𝛼subscriptcon with 𝛼0.1\vspace{-5pt}\mathcal{L}=\mathcal{L}_{\lambda\mathrm{MSE}}+\alpha\mathcal{L}_{% \mathrm{con}},\text{ with }\alpha=0.1.caligraphic_L = caligraphic_L start_POSTSUBSCRIPT italic_λ roman_MSE end_POSTSUBSCRIPT + italic_α caligraphic_L start_POSTSUBSCRIPT roman_con end_POSTSUBSCRIPT , with italic_α = 0.1 . (5)

The loss defined here serves as a motivation for the designed output representation. Specifically, employing a Cartesian representation and applying the loss directly to the output space would result in backpropagation through (x𝑥xitalic_x, y𝑦yitalic_y), and zlogsubscript𝑧z_{\log}italic_z start_POSTSUBSCRIPT roman_log end_POSTSUBSCRIPT errors. However, x𝑥xitalic_x and y𝑦yitalic_y components are derived as rxzsubscript𝑟𝑥𝑧r_{x}\cdot zitalic_r start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ⋅ italic_z and ryzsubscript𝑟𝑦𝑧r_{y}\cdot zitalic_r start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ⋅ italic_z as detailed in Sec. 3.1. Consequently, the gradients of camera components, expressed by (rxsubscript𝑟𝑥r_{x}italic_r start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT, rysubscript𝑟𝑦r_{y}italic_r start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT), and of depth become intertwined, leading to suboptimal optimization as discussed in Sec. 4.3.

4 Experiments

Table 1: Comparison on zero-shot evaluation. All methods are tested in a zero-shot setting on eight different datasets without overlap with any of the sets used for training. UniDepth-{C, V}: UniDepth-{ConvNext [33], ViT [12]}. (†): DDAD [20] in training set. (‡): predicted intrinsics are utilized for conditioning and backprojecting. Best viewed on a screen and zoomed in.
Method NuScenes DDAD ETH3D Diode (Indoor) SUN-RGBD VOID IBims-1 HAMMER
δ1subscript𝛿1absent\mathrm{\delta}_{1}\uparrowitalic_δ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ↑ SIlogsubscriptSIabsent\mathrm{SI}_{\log}\downarrowroman_SI start_POSTSUBSCRIPT roman_log end_POSTSUBSCRIPT ↓ FAsubscriptF𝐴absent\mathrm{F}_{A}\uparrowroman_F start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ↑ δ1subscript𝛿1absent\mathrm{\delta}_{1}\uparrowitalic_δ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ↑ SIlogsubscriptSIabsent\mathrm{SI}_{\log}\downarrowroman_SI start_POSTSUBSCRIPT roman_log end_POSTSUBSCRIPT ↓ FAsubscriptF𝐴absent\mathrm{F}_{A}\uparrowroman_F start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ↑ δ1subscript𝛿1absent\mathrm{\delta}_{1}\uparrowitalic_δ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ↑ SIlogsubscriptSIabsent\mathrm{SI}_{\log}\downarrowroman_SI start_POSTSUBSCRIPT roman_log end_POSTSUBSCRIPT ↓ FAsubscriptF𝐴absent\mathrm{F}_{A}\uparrowroman_F start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ↑ δ1subscript𝛿1absent\mathrm{\delta}_{1}\uparrowitalic_δ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ↑ SIlogsubscriptSIabsent\mathrm{SI}_{\log}\downarrowroman_SI start_POSTSUBSCRIPT roman_log end_POSTSUBSCRIPT ↓ FAsubscriptF𝐴absent\mathrm{F}_{A}\uparrowroman_F start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ↑ δ1subscript𝛿1absent\mathrm{\delta}_{1}\uparrowitalic_δ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ↑ SIlogsubscriptSIabsent\mathrm{SI}_{\log}\downarrowroman_SI start_POSTSUBSCRIPT roman_log end_POSTSUBSCRIPT ↓ FAsubscriptF𝐴absent\mathrm{F}_{A}\uparrowroman_F start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ↑ δ1subscript𝛿1absent\mathrm{\delta}_{1}\uparrowitalic_δ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ↑ SIlogsubscriptSIabsent\mathrm{SI}_{\log}\downarrowroman_SI start_POSTSUBSCRIPT roman_log end_POSTSUBSCRIPT ↓ FAsubscriptF𝐴absent\mathrm{F}_{A}\uparrowroman_F start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ↑ δ1subscript𝛿1absent\mathrm{\delta}_{1}\uparrowitalic_δ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ↑ SIlogsubscriptSIabsent\mathrm{SI}_{\log}\downarrowroman_SI start_POSTSUBSCRIPT roman_log end_POSTSUBSCRIPT ↓ FAsubscriptF𝐴absent\mathrm{F}_{A}\uparrowroman_F start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ↑ δ1subscript𝛿1absent\mathrm{\delta}_{1}\uparrowitalic_δ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ↑ SIlogsubscriptSIabsent\mathrm{SI}_{\log}\downarrowroman_SI start_POSTSUBSCRIPT roman_log end_POSTSUBSCRIPT ↓ FAsubscriptF𝐴absent\mathrm{F}_{A}\uparrowroman_F start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ↑
BTS [28] 33.733.733.733.7 68.068.068.068.0 37.537.537.537.5 43.043.043.043.0 40.840.840.840.8 40.540.540.540.5 26.826.826.826.8 29.929.929.929.9 27.427.427.427.4 19.219.219.219.2 22.822.822.822.8 31.631.631.631.6 76.176.176.176.1 14.614.614.614.6 64.864.864.864.8 47.447.447.447.4 25.825.825.825.8 64.564.564.564.5 53.153.153.153.1 17.517.517.517.5 57.257.257.257.2 3.893.893.893.89 20.920.920.920.9 22.822.822.822.8
AdaBins [3] 33.333.333.333.3 61.461.461.461.4 35.235.235.235.2 37.737.737.737.7 44.444.444.444.4 35.635.635.635.6 24.324.324.324.3 28.328.328.328.3 25.225.225.225.2 17.417.417.417.4 21.621.621.621.6 28.728.728.728.7 77.777.777.777.7 13.913.913.913.9 65.465.465.465.4 50.550.550.550.5 23.823.823.823.8 65.065.065.065.0 55.055.055.055.0 15.615.615.615.6 57.857.857.857.8 7.217.217.217.21 21.521.521.521.5 27.727.727.727.7
NeWCRF [61] 44.244.244.244.2 49.449.449.449.4 42.242.242.242.2 45.645.645.645.6 34.934.934.934.9 41.641.641.641.6 35.735.735.735.7 26.126.126.126.1 32.332.332.332.3 20.120.120.120.1 18.518.518.518.5 35.335.335.335.3 75.375.375.375.3 11.911.911.911.9 61.661.661.661.6 53.153.153.153.1 22.322.322.322.3 67.967.967.967.9 53.653.653.653.6 14.714.714.714.7 59.259.259.259.2 1.431.431.431.43 14.914.914.914.9 20.820.820.820.8
iDisc [41] 39.439.439.439.4 37.137.137.137.1 34.534.534.534.5 28.428.428.428.4 32.232.232.232.2 25.825.825.825.8 35.635.635.635.6 27.527.527.527.5 31.431.431.431.4 23.823.823.823.8 15.815.815.815.8 33.433.433.433.4 83.783.783.783.7 12.412.412.412.4 71.071.071.071.0 55.355.355.355.3 20.320.320.320.3 68.668.668.668.6 48.948.948.948.9 13.213.213.213.2 55.455.455.455.4 2.582.582.582.58 14.014.014.014.0 32.632.632.632.6
ZoeDepth [4] 28.328.328.328.3 31.531.531.531.5 26.026.026.026.0 27.227.227.227.2 31.731.731.731.7 21.121.121.121.1 35.035.035.035.0 17.617.617.617.6 26.426.426.426.4 36.936.936.936.9 12.812.812.812.8 40.540.540.540.5 86.786.786.786.7 9.589.589.589.58 75.675.675.675.6 63.463.463.463.4 15.915.915.915.9 72.472.472.472.4 58.058.058.058.0 10.910.910.910.9 59.659.659.659.6 0.720.720.720.72 9.789.789.789.78 21.021.021.021.0
Metric3D [59] 72.372.372.372.3 29.029.029.029.0 53.953.953.953.9 -- -- -- 45.6¯¯45.6\underline{45.6}under¯ start_ARG 45.6 end_ARG 18.918.918.918.9 35.935.9\mathbf{35.9}bold_35.9 39.239.239.239.2 11.111.111.111.1 42.142.142.142.1 15.415.415.415.4 13.413.413.413.4 14.414.414.414.4 65.965.965.965.9 16.216.216.216.2 70.470.470.470.4 79.779.7\mathbf{79.7}bold_79.7 10.110.110.110.1 68.568.5\mathbf{68.5}bold_68.5 3.403.403.403.40 12.112.112.112.1 29.029.029.029.0
UniDepth-C 83.3¯¯83.3\underline{83.3}under¯ start_ARG 83.3 end_ARG 22.9¯¯22.9\underline{22.9}under¯ start_ARG 22.9 end_ARG 62.362.362.362.3 83.2¯¯83.2\underline{83.2}under¯ start_ARG 83.2 end_ARG 21.4¯¯21.4\underline{21.4}under¯ start_ARG 21.4 end_ARG 59.359.359.359.3 49.849.8\mathbf{49.8}bold_49.8 13.213.213.213.2 33.7¯¯33.7\underline{33.7}under¯ start_ARG 33.7 end_ARG 60.260.260.260.2 9.039.039.039.03 50.050.050.050.0 94.8¯¯94.8\underline{94.8}under¯ start_ARG 94.8 end_ARG 8.108.108.108.10 81.4¯¯81.4\underline{81.4}under¯ start_ARG 81.4 end_ARG 86.686.686.686.6 12.8¯¯12.8\underline{12.8}under¯ start_ARG 12.8 end_ARG 85.185.185.185.1 79.779.7\mathbf{79.7}bold_79.7 8.928.928.928.92 66.7¯¯66.7\underline{66.7}under¯ start_ARG 66.7 end_ARG 20.220.2\mathbf{20.2}bold_20.2 8.78¯¯8.78\underline{8.78}under¯ start_ARG 8.78 end_ARG 57.157.1\mathbf{57.1}bold_57.1
UniDepth-V 86.286.2\mathbf{86.2}bold_86.2 21.721.7\mathbf{21.7}bold_21.7 64.264.2\mathbf{64.2}bold_64.2 86.486.4\mathbf{86.4}bold_86.4 20.320.3\mathbf{20.3}bold_20.3 61.861.8\mathbf{61.8}bold_61.8 32.632.632.632.6 11.6¯¯11.6\underline{11.6}under¯ start_ARG 11.6 end_ARG 24.324.324.324.3 77.1¯¯77.1\underline{77.1}under¯ start_ARG 77.1 end_ARG 6.38¯¯6.38\underline{6.38}under¯ start_ARG 6.38 end_ARG 59.459.4\mathbf{59.4}bold_59.4 96.696.6\mathbf{96.6}bold_96.6 7.057.05\mathbf{7.05}bold_7.05 81.981.9\mathbf{81.9}bold_81.9 89.4¯¯89.4\underline{89.4}under¯ start_ARG 89.4 end_ARG 10.910.9\mathbf{10.9}bold_10.9 85.7¯¯85.7\underline{85.7}under¯ start_ARG 85.7 end_ARG 23.923.923.923.9 7.22¯¯7.22\underline{7.22}under¯ start_ARG 7.22 end_ARG 37.137.137.137.1 13.3¯¯13.3\underline{13.3}under¯ start_ARG 13.3 end_ARG 7.417.41\mathbf{7.41}bold_7.41 55.9¯¯55.9\underline{55.9}under¯ start_ARG 55.9 end_ARG
UniDepth-C 83.3¯¯83.3\underline{83.3}under¯ start_ARG 83.3 end_ARG 22.9¯¯22.9\underline{22.9}under¯ start_ARG 22.9 end_ARG 60.960.960.960.9 83.183.183.183.1 21.4¯¯21.4\underline{21.4}under¯ start_ARG 21.4 end_ARG 57.357.357.357.3 22.922.922.922.9 13.113.113.113.1 25.425.425.425.4 60.460.460.460.4 9.019.019.019.01 49.949.949.949.9 92.392.392.392.3 8.278.278.278.27 75.275.275.275.2 86.586.586.586.5 12.8¯¯12.8\underline{12.8}under¯ start_ARG 12.8 end_ARG 85.085.085.085.0 79.4¯¯79.4\underline{79.4}under¯ start_ARG 79.4 end_ARG 8.888.888.888.88 64.264.264.264.2 12.712.712.712.7 9.309.309.309.30 54.854.854.854.8
UniDepth-V 86.286.2\mathbf{86.2}bold_86.2 21.721.7\mathbf{21.7}bold_21.7 63.0¯¯63.0\underline{63.0}under¯ start_ARG 63.0 end_ARG 86.486.4\mathbf{86.4}bold_86.4 20.320.3\mathbf{20.3}bold_20.3 60.4¯¯60.4\underline{60.4}under¯ start_ARG 60.4 end_ARG 17.617.617.617.6 11.411.4\mathbf{11.4}bold_11.4 21.421.421.421.4 77.477.4\mathbf{77.4}bold_77.4 6.366.36\mathbf{6.36}bold_6.36 58.6¯¯58.6\underline{58.6}under¯ start_ARG 58.6 end_ARG 94.8¯¯94.8\underline{94.8}under¯ start_ARG 94.8 end_ARG 7.17¯¯7.17\underline{7.17}under¯ start_ARG 7.17 end_ARG 75.975.975.975.9 90.290.2\mathbf{90.2}bold_90.2 10.910.9\mathbf{10.9}bold_10.9 86.286.2\mathbf{86.2}bold_86.2 17.517.517.517.5 7.207.20\mathbf{7.20}bold_7.20 36.536.536.536.5 2.562.562.562.56 8.358.358.358.35 53.853.853.853.8

4.1 Experimental Setup

In-domain training datasets. The training dataset utilized is the ensemble of Argoverse2 [53], Waymo [49], DrivingStereo [56], Cityscapes [7], BDD100K [60], Mapillary-PSD [1], A2D2 [19], ScanNet [8], and Taskonomy [62]. The resulting dataset amounts roughly to 3M real-world images with different cameras and domains, compared to, e.g. Metric3D [59] and ZeroDepth [21] which exploit 8M and 17M training images, respectively.

Zero-shot testing datasets. We evaluate the generalizability of the compared models by testing them on ten datasets not seen during training. More precisely, each method is tested on validation splits from SUN-RGBD [48] without NYU split, Diode Indoor [50] , IBims-1 [26], VOID [54] HAMMER [25], ETH-3D [44], nuScenes [5], and DDAD [20] with split proposed in [41] and evaluated with official masks. Also, UniDepth and the models from [59, 21] are zero-shot-tested on NYU-Depth V2 [35] and KITTI [18]. In particular, KITTI testing is performed on the corrected Eigen-split test set [14] with the Garg evaluation mask [17], while NYU testing uses the evaluation mask from [28].

Evaluation Details. All methods have been re-evaluated with a fair and consistent pipeline. In particular, we do not exploit any test-time augmentations. We use training image shapes for zero-shot testing and evaluate on the same validation splits and masks. Unfortunately, ZeroDepth lacks full code reproducibility, thus we report results from the original paper only, and for visualization, we utilize their provided code and weights. When methods do not report the configuration for a specific test dataset, we use the settings of NYU and KITTI for indoor and outdoor testing, respectively. We utilize common depth estimation evaluation metrics: root mean square error (RMSRMS\mathrm{RMS}roman_RMS) and its log variant (RMSlogsubscriptRMSlog\mathrm{RMS_{log}}roman_RMS start_POSTSUBSCRIPT roman_log end_POSTSUBSCRIPT), absolute mean relative error (A.Relformulae-sequenceARel\mathrm{A.Rel}roman_A . roman_Rel), the percentage of inlier pixels (δisubscript𝛿𝑖\mathrm{\delta}_{i}italic_δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT) with threshold 1.25isuperscript1.25𝑖1.25^{i}1.25 start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT, scale-invariant error in log-scale (SIlogsubscriptSIlog\mathrm{SI_{log}}roman_SI start_POSTSUBSCRIPT roman_log end_POSTSUBSCRIPT): 100Var(εlog)100Varsubscript𝜀100\sqrt{\mathrm{Var}(\varepsilon_{\log})}100 square-root start_ARG roman_Var ( italic_ε start_POSTSUBSCRIPT roman_log end_POSTSUBSCRIPT ) end_ARG. In addition, we report point-cloud-based metrics proposed in [37], namely Chamfer Distance (CDCD\mathrm{CD}roman_CD) and F-score (FAsubscriptFA\mathrm{F_{A}}roman_F start_POSTSUBSCRIPT roman_A end_POSTSUBSCRIPT), with the latter aggregated as the area under the curve up to 1/201201/201 / 20 of the datasets’ maximum depth. All methods exploit GT intrinsics during evaluation. Nonetheless, we present results both with and without GT intrinsics for UniDepth.

Implementation Details. UniDepth is implemented in PyTorch [39] and CUDA [36]. For training, we use the AdamW [34] optimizer (β1=0.9subscript𝛽10.9\beta_{1}=0.9italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0.9, β2=0.999subscript𝛽20.999\beta_{2}=0.999italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0.999) with an initial learning rate of 0.00010.00010.00010.0001. The learning rate is divided by a factor of 10 for the backbone weights for every experiment and weight decay is set to 0.10.10.10.1. As the learning rate scheduler, we exploit Cosine Annealing to one-tenth starting from 30% of the training. We run 1M optimization iterations with a batch size of 128, each training dataset is uniformly represented in each batch. In particular, we sample 64 images and then we sample two different augmented views of the same image for consistency loss. The augmentations include both geometric and appearance (random brightness, gamma, saturation, hue shift, and grayscale) augmentations. ViT-L [12] backbone is initialized with weights from DINO-pre-trained [6] models, and ConvNext-L [33] is ImageNet [9]-pre-trained. The required training time amounts to roughly 12 days on 8 NVIDIA A100. Ablations are conducted with three different seeds and for 100k training iterations, using a randomly sampled subset with a size equal to 20% of the original training set.

Table 2: Comparison on NYU test set. The first five methods are trained on NYU and tested on it. The last four methods are tested in a zero-shot setting. UniDepth-{C, V}: UniDepth-{ConvNext [33], ViT [12]}. (†): MiDaS [42] pre-trained.
Method δ0.5subscript𝛿0.5\mathrm{\delta}_{{0.5}}italic_δ start_POSTSUBSCRIPT 0.5 end_POSTSUBSCRIPT δ1subscript𝛿1\mathrm{\delta}_{1}italic_δ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT FAsubscriptF𝐴\mathrm{F}_{A}roman_F start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT A.Relformulae-sequenceARel\mathrm{A.Rel}roman_A . roman_Rel RMSRMS\mathrm{RMS}roman_RMS RMSlogsubscriptRMS\mathrm{RMS}_{\log}roman_RMS start_POSTSUBSCRIPT roman_log end_POSTSUBSCRIPT CDCD\mathrm{CD}roman_CD SIlogsubscriptSI\mathrm{SI}_{\log}roman_SI start_POSTSUBSCRIPT roman_log end_POSTSUBSCRIPT
Higher is better Lower is better
BTS [28] 66.166.166.166.1 88.588.588.588.5 74.074.074.074.0 10.910.910.910.9 0.3910.3910.3910.391 0.1410.1410.1410.141 0.1600.1600.1600.160 11.511.511.511.5
AdaBins [3] 68.168.168.168.1 90.190.190.190.1 74.774.774.774.7 10.310.310.310.3 0.3650.3650.3650.365 0.1310.1310.1310.131 0.1560.1560.1560.156 10.610.610.610.6
NeWCRF [61] 69.669.669.669.6 92.192.192.192.1 75.875.875.875.8 9.569.569.569.56 0.3330.3330.3330.333 0.1190.1190.1190.119 0.1470.1470.1470.147 9.169.169.169.16
iDisc [41] 74.574.574.574.5 93.893.893.893.8 78.278.278.278.2 8.618.618.618.61 0.3130.3130.3130.313 0.1100.1100.1100.110 0.1330.1330.1330.133 8.858.858.858.85
ZoeDepth [4] 78.478.478.478.4 95.295.295.295.2 80.180.180.180.1 7.707.707.707.70 0.2780.2780.2780.278 0.0970.0970.0970.097 0.1250.1250.1250.125 7.197.197.197.19
ZeroDepth [21] -- 90.190.190.190.1 -- 10.010.010.010.0 0.3800.3800.3800.380 -- -- --
Metric3D [59] 76.376.376.376.3 92.692.692.692.6 77.877.877.877.8 9.389.389.389.38 0.3370.3370.3370.337 0.1200.1200.1200.120 0.1460.1460.1460.146 9.139.139.139.13
UniDepth-C 85.4¯¯85.4\underline{85.4}under¯ start_ARG 85.4 end_ARG 97.2¯¯97.2\underline{97.2}under¯ start_ARG 97.2 end_ARG 84.3¯¯84.3\underline{84.3}under¯ start_ARG 84.3 end_ARG 6.26¯¯6.26\underline{6.26}under¯ start_ARG 6.26 end_ARG 0.232¯¯0.232\underline{0.232}under¯ start_ARG 0.232 end_ARG 0.082¯¯0.082\underline{0.082}under¯ start_ARG 0.082 end_ARG 0.101¯¯0.101\underline{0.101}under¯ start_ARG 0.101 end_ARG 6.41¯¯6.41\underline{6.41}under¯ start_ARG 6.41 end_ARG
UniDepth-V 88.688.6\mathbf{88.6}bold_88.6 98.498.4\mathbf{98.4}bold_98.4 85.985.9\mathbf{85.9}bold_85.9 5.785.78\mathbf{5.78}bold_5.78 0.2010.201\mathbf{0.201}bold_0.201 0.0730.073\mathbf{0.073}bold_0.073 0.0920.092\mathbf{0.092}bold_0.092 5.275.27\mathbf{5.27}bold_5.27

4.2 Comparison with the State of the Art

Our method consistently outperforms previous SotA methods as shown in Table 1. We particularly excel in the scale-invariant aspect, represented by SIlogsubscriptSI\mathrm{SI}_{\log}roman_SI start_POSTSUBSCRIPT roman_log end_POSTSUBSCRIPT, with an average 34.0% improvement, and an average 12.3% improvement for δ1subscript𝛿1\mathrm{\delta}_{1}italic_δ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and FAsubscriptF𝐴\mathrm{F}_{A}roman_F start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT. However, UniDepth could fail to capture the specific scene scales in certain cases, e.g. in ETH3D and IBims-1. This pitfall is demonstrated by the drop in scale-dependent metrics, e.g. FAsubscriptF𝐴\mathrm{F}_{A}roman_F start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT drop is 11.8% and 31.4%, respectively, although having a clear scale-invariant improvement of 36.9% and 28.5%. Therefore, we speculate that our method would still greatly benefit from domain-specific fine-tuning.

Table 3: Comparison on KITTI Eigen-split test set. The first five methods are trained on KITTI and tested on it. The last four methods are tested in a zero-shot setting. UniDepth-{C, V}: UniDepth-{ConvNext [33], ViT [12]}. (†): MiDaS [42] pre-trained.
Method δ0.5subscript𝛿0.5\mathrm{\delta}_{{0.5}}italic_δ start_POSTSUBSCRIPT 0.5 end_POSTSUBSCRIPT δ1subscript𝛿1\mathrm{\delta}_{1}italic_δ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT FAsubscriptF𝐴\mathrm{F}_{A}roman_F start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT A.Relformulae-sequenceARel\mathrm{A.Rel}roman_A . roman_Rel RMSRMS\mathrm{RMS}roman_RMS RMSlogsubscriptRMS\mathrm{RMS}_{\log}roman_RMS start_POSTSUBSCRIPT roman_log end_POSTSUBSCRIPT CDCD\mathrm{CD}roman_CD SIlogsubscriptSI\mathrm{SI}_{\log}roman_SI start_POSTSUBSCRIPT roman_log end_POSTSUBSCRIPT
Higher is better Lower is better
BTS [28] 86.986.986.986.9 96.296.296.296.2 82.082.082.082.0 5.635.635.635.63 2.432.432.432.43 0.0890.0890.0890.089 0.420.420.420.42 8.188.188.188.18
AdaBins [3] 86.286.286.286.2 96.396.396.396.3 81.581.581.581.5 5.855.855.855.85 2.382.382.382.38 0.0890.0890.0890.089 0.4290.4290.4290.429 8.108.108.108.10
NeWCRF [61] 88.988.988.988.9 97.597.597.597.5 82.782.782.782.7 5.205.205.205.20 2.072.072.072.07 0.0780.0780.0780.078 0.3880.3880.3880.388 7.007.007.007.00
iDisc [41] 89.289.289.289.2 97.597.597.597.5 83.183.183.183.1 5.095.095.095.09 2.072.072.072.07 0.0770.0770.0770.077 0.3800.3800.3800.380 7.117.117.117.11
ZoeDepth [4] 87.487.487.487.4 96.596.596.596.5 82.182.182.182.1 5.765.765.765.76 2.392.392.392.39 0.0890.0890.0890.089 0.4310.4310.4310.431 7.477.477.477.47
ZeroDepth [21] -- 89.289.289.289.2 -- 10.210.210.210.2 4.384.384.384.38 0.1960.1960.1960.196 -- --
Metric3D [59] 88.988.988.988.9 97.597.597.597.5 82.982.982.982.9 5.335.335.335.33 2.262.262.262.26 0.0810.0810.0810.081 0.3920.3920.3920.392 7.287.287.287.28
UniDepth-C 91.1¯¯91.1\underline{91.1}under¯ start_ARG 91.1 end_ARG 97.9¯¯97.9\underline{97.9}under¯ start_ARG 97.9 end_ARG 83.9¯¯83.9\underline{83.9}under¯ start_ARG 83.9 end_ARG 4.69¯¯4.69\underline{4.69}under¯ start_ARG 4.69 end_ARG 2.00¯¯2.00\underline{2.00}under¯ start_ARG 2.00 end_ARG 0.072¯¯0.072\underline{0.072}under¯ start_ARG 0.072 end_ARG 0.371¯¯0.371\underline{0.371}under¯ start_ARG 0.371 end_ARG 6.71¯¯6.71\underline{6.71}under¯ start_ARG 6.71 end_ARG
UniDepth-V 93.493.4\mathbf{93.4}bold_93.4 98.698.6\mathbf{98.6}bold_98.6 85.085.0\mathbf{85.0}bold_85.0 4.214.21\mathbf{4.21}bold_4.21 1.751.75\mathbf{1.75}bold_1.75 0.0640.064\mathbf{0.064}bold_0.064 0.3380.338\mathbf{0.338}bold_0.338 5.845.84\mathbf{5.84}bold_5.84
KITTI Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
NYUv2 Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Diode Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
RGB & GT ZoeDepth [4] ZeroDepth [21] Metric3D [59] UniDepth Meters |||| A.Relformulae-sequenceARel\mathrm{A.Rel}roman_A . roman_Rel
Figure 4: Zero-shot qualitative results. Each pair of consecutive rows corresponds to one test sample. Each odd row shows the input RGB image and the predicted pointcloud color-coded with coolwarm based on the absolute relative error. Each even row shows GT depth and the predicted depth. The last column represents the specific colormap ranges for depth and error. (†): KITTI and NYU in the training set.

The last two rows in Table 1 present UniDepth in its whole design, namely functioning with solely the input image by self-prompting the predicted dense camera representation, as detailed in Eq. 3. Experiments show that not only is the performance preserved for most of the test sets, but UniDepth with the bootstrapped camera can also outperform models with GT camera, e.g. SIlogsubscriptSI\mathrm{SI}_{\log}roman_SI start_POSTSUBSCRIPT roman_log end_POSTSUBSCRIPT in ETH3D and IBims-1. On the other hand, in cases with particularly out-of-domain camera types, such as ETH3D or HAMMER, bootstrap** camera prediction results in additional noise for scaled depth prediction, thus worsening results for δ1subscript𝛿1\mathrm{\delta}_{1}italic_δ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT.

Table 2 and Table 3 display results on the two popular benchmark NYU [35] and KITTI [18] Eigen-split. UniDepth sets the state of the art in these two benchmarks despite being compared with models trained on the same domain. Importantly, the KITTI Depth Prediction Benchmark, which provides a perfectly fair evaluation, underscores the excellent zero-shot performance of our method and its robustness compared to the current MMDE SotA methods, as UniDepth ranks first on this benchmark at the time of submission, with a 15.5% improvement in SIlogsubscriptSI\mathrm{SI}_{\log}roman_SI start_POSTSUBSCRIPT roman_log end_POSTSUBSCRIPT over the second-best method. Performance disparities are not solely attributed to dataset characteristics, as observed in the comparison with Metric3D and ZeroDepth. Despite being trained on a smaller dataset, UniDepth outperforms both of these methods. In particular, UniDepth improves in δ1subscript𝛿1\mathrm{\delta}_{1}italic_δ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT over Metric3D and ZeroDepth by 5.8% and 7.3%, respectively, on NYU (Table 2) and by 1.1% and 9.4%, respectively, on KITTI (Table 3). Moreover, ZoeDepth, which has a capacity similar to our ViT-based approach and is pre-trained on the diverse MiDaS dataset [42], shows limitations in general zero-shot scenarios in Table 1, exhibiting performance comparable to traditional MMDE methods especially on scale-invariant metrics.

Table 4: Comparison with equivalent training setup. All methods have the same backbone, ConvNext-L [33] and are tested in a zero-shot regime on KITTI Eigen-split and NYU. iDisc and UniDepth are retrained on a strict subset of Metric3D for 500k iterations as in [59].
Method KITTI NYU
δ1subscript𝛿1absent\mathrm{\delta}_{1}\uparrowitalic_δ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ↑ SIlogsubscriptSIabsent\mathrm{SI}_{\log}\downarrowroman_SI start_POSTSUBSCRIPT roman_log end_POSTSUBSCRIPT ↓ FAsubscriptF𝐴absent\mathrm{F}_{A}\uparrowroman_F start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ↑ δ1subscript𝛿1absent\mathrm{\delta}_{1}\uparrowitalic_δ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ↑ SIlogsubscriptSIabsent\mathrm{SI}_{\log}\downarrowroman_SI start_POSTSUBSCRIPT roman_log end_POSTSUBSCRIPT ↓ FAsubscriptF𝐴absent\mathrm{F}_{A}\uparrowroman_F start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ↑
iDisc [41] 93.493.493.493.4 8.368.368.368.36 78.078.078.078.0 92.192.192.192.1 8.82¯¯8.82\underline{8.82}under¯ start_ARG 8.82 end_ARG 75.075.075.075.0
Metric3D [59] 97.5¯¯97.5\underline{97.5}under¯ start_ARG 97.5 end_ARG 7.28¯¯7.28\underline{7.28}under¯ start_ARG 7.28 end_ARG 82.9¯¯82.9\underline{82.9}under¯ start_ARG 82.9 end_ARG 92.6¯¯92.6\underline{92.6}under¯ start_ARG 92.6 end_ARG 9.139.139.139.13 77.8¯¯77.8\underline{77.8}under¯ start_ARG 77.8 end_ARG
UniDepth 97.997.9\mathbf{97.9}bold_97.9 6.666.66\mathbf{6.66}bold_6.66 83.883.8\mathbf{83.8}bold_83.8 97.197.1\mathbf{97.1}bold_97.1 6.696.69\mathbf{6.69}bold_6.69 84.384.3\mathbf{84.3}bold_84.3

For the sake of fair comparison, we provide in Table 4 a comparison between Metric3D, iDisc, and UniDepth where the latter two are retrained on a strict subset of Metric3D’s data, namely accounting for one-quarter of the original Metric3D dataset, with same framework detailed in Sec. 4.1. The results are two-fold: they demonstrate how UniDepth still surpasses Metric3D with a subsplit of the training set, and how MMDE SotA methods designed for single-domain can not fully exploit the training diversity. Qualitative results in Fig. 4 emphasize how the method excels in capturing the overall scale and scene complexity in a zero-shot setup.

4.3 Ablation Study

Table 5: Ablations of UniDepth. In-Domain corresponds to the union of the training domain’s validation sets, while Out-of-Domain involves the union of zero-shot testing sets. Oracle is the model with provided GT cameras at training and test time. Baseline directly predicts 3D points in Cartesian space, Baseline++ in pseudo-spherical. Full represents the final UniDepth. All models have the same depth and camera module architecture, if any. ARelCsubscriptARel𝐶\mathrm{ARel}_{C}roman_ARel start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT is the mean of elementwise absolute relative error for camera intrinsics. (†): GT camera intrinsics utilized for backprojection. The backbone used is ConvNext-L [33]. Medians and median average deviations over three runs are reported.
Ablation In-Domain Out-of-Domain
δ1subscript𝛿1absent\mathrm{\delta}_{1}\uparrowitalic_δ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ↑ SIlogsubscriptSIabsent\mathrm{SI}_{\log}\downarrowroman_SI start_POSTSUBSCRIPT roman_log end_POSTSUBSCRIPT ↓ FAsubscriptF𝐴absent\mathrm{F}_{A}\uparrowroman_F start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ↑ ARelCsubscriptARel𝐶absent\mathrm{ARel}_{C}\downarrowroman_ARel start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ↓ δ1subscript𝛿1absent\mathrm{\delta}_{1}\uparrowitalic_δ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ↑ SIlogsubscriptSIabsent\mathrm{SI}_{\log}\downarrowroman_SI start_POSTSUBSCRIPT roman_log end_POSTSUBSCRIPT ↓ FAsubscriptF𝐴absent\mathrm{F}_{A}\uparrowroman_F start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ↑ ARelCsubscriptARel𝐶absent\mathrm{ARel}_{C}\downarrowroman_ARel start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ↓
1 Oracle 89.06±0.03plus-or-minus89.060.0389.06\scriptstyle\pm 0.0389.06 ± 0.03 13.15±0.02plus-or-minus13.150.0213.15\scriptstyle\pm 0.0213.15 ± 0.02 65.45±0.13plus-or-minus65.450.1365.45\scriptstyle\pm 0.1365.45 ± 0.13 n/a 68.11±0.17plus-or-minus68.110.1768.11\scriptstyle\pm 0.1768.11 ± 0.17 14.78±0.01plus-or-minus14.780.0114.78\scriptstyle\pm 0.0114.78 ± 0.01 57.17±0.09plus-or-minus57.170.09{\color[rgb]{.5,.5,.5}57.17\scriptstyle\pm 0.09}57.17 ± 0.09 n/a
2 Full 88.89±0.10plus-or-minus88.890.1088.89\scriptstyle\pm 0.1088.89 ± 0.10 13.13±0.01plus-or-minus13.130.0113.13\scriptstyle\pm 0.0113.13 ± 0.01 63.52±0.08plus-or-minus63.520.0863.52\scriptstyle\pm 0.0863.52 ± 0.08 2.05±0.01plus-or-minus2.050.012.05\scriptstyle\pm 0.012.05 ± 0.01 57.06±1.48plus-or-minus57.061.4857.06\scriptstyle\pm 1.4857.06 ± 1.48 14.83±0.04plus-or-minus14.830.0414.83\scriptstyle\pm 0.0414.83 ± 0.04 49.71±0.55plus-or-minus49.710.5549.71\scriptstyle\pm 0.5549.71 ± 0.55 13.54±0.85plus-or-minus13.540.8513.54\scriptstyle\pm 0.8513.54 ± 0.85
3 – Camera 87.42±0.04plus-or-minus87.420.0487.42\scriptstyle\pm 0.0487.42 ± 0.04 13.49±0.08plus-or-minus13.490.0813.49\scriptstyle\pm 0.0813.49 ± 0.08 63.78±0.02plus-or-minus63.780.0263.78\scriptstyle\pm 0.0263.78 ± 0.02 n/a 48.38±0.97plus-or-minus48.380.9748.38\scriptstyle\pm 0.9748.38 ± 0.97 15.55±0.15plus-or-minus15.550.1515.55\scriptstyle\pm 0.1515.55 ± 0.15 45.21±0.86plus-or-minus45.210.8645.21\scriptstyle\pm 0.8645.21 ± 0.86 n/a
4 – Spherical 61.30±1.00plus-or-minus61.301.0061.30\scriptstyle\pm 1.0061.30 ± 1.00 19.36±0.09plus-or-minus19.360.0919.36\scriptstyle\pm 0.0919.36 ± 0.09 17.89±0.11plus-or-minus17.890.1117.89\scriptstyle\pm 0.1117.89 ± 0.11 48.29±4.03plus-or-minus48.294.0348.29\scriptstyle\pm 4.0348.29 ± 4.03 37.09±1.37plus-or-minus37.091.3737.09\scriptstyle\pm 1.3737.09 ± 1.37 22.49±0.16plus-or-minus22.490.1622.49\scriptstyle\pm 0.1622.49 ± 0.16 21.78±0.14plus-or-minus21.780.1421.78\scriptstyle\pm 0.1421.78 ± 0.14 87.51±11.1plus-or-minus87.5111.187.51\scriptstyle\pm 11.187.51 ± 11.1
5 consubscriptcon\mathcal{L}_{\mathrm{con}}caligraphic_L start_POSTSUBSCRIPT roman_con end_POSTSUBSCRIPT 88.53±0.07plus-or-minus88.530.0788.53\scriptstyle\pm 0.0788.53 ± 0.07 13.24±0.01plus-or-minus13.240.0113.24\scriptstyle\pm 0.0113.24 ± 0.01 60.89±0.15plus-or-minus60.890.1560.89\scriptstyle\pm 0.1560.89 ± 0.15 2.65±0.06plus-or-minus2.650.062.65\scriptstyle\pm 0.062.65 ± 0.06 52.89±0.21plus-or-minus52.890.2152.89\scriptstyle\pm 0.2152.89 ± 0.21 14.85±0.01plus-or-minus14.850.0114.85\scriptstyle\pm 0.0114.85 ± 0.01 45.17±0.32plus-or-minus45.170.3245.17\scriptstyle\pm 0.3245.17 ± 0.32 14.27±0.41plus-or-minus14.270.4114.27\scriptstyle\pm 0.4114.27 ± 0.41
6 – Dense 87.62±0.11plus-or-minus87.620.1187.62\scriptstyle\pm 0.1187.62 ± 0.11 13.41±0.05plus-or-minus13.410.0513.41\scriptstyle\pm 0.0513.41 ± 0.05 61.33±0.54plus-or-minus61.330.5461.33\scriptstyle\pm 0.5461.33 ± 0.54 1.91±0.04plus-or-minus1.910.041.91\scriptstyle\pm 0.041.91 ± 0.04 55.65±0.18plus-or-minus55.650.1855.65\scriptstyle\pm 0.1855.65 ± 0.18 15.04±0.04plus-or-minus15.040.0415.04\scriptstyle\pm 0.0415.04 ± 0.04 43.19±0.24plus-or-minus43.190.2443.19\scriptstyle\pm 0.2443.19 ± 0.24 16.61±0.41plus-or-minus16.610.4116.61\scriptstyle\pm 0.4116.61 ± 0.41
7 – Detach 88.16±0.12plus-or-minus88.160.1288.16\scriptstyle\pm 0.1288.16 ± 0.12 13.48±0.06plus-or-minus13.480.0613.48\scriptstyle\pm 0.0613.48 ± 0.06 64.19±0.17plus-or-minus64.190.1764.19\scriptstyle\pm 0.1764.19 ± 0.17 0.93±0.02plus-or-minus0.930.020.93\scriptstyle\pm 0.020.93 ± 0.02 46.60±0.25plus-or-minus46.600.2546.60\scriptstyle\pm 0.2546.60 ± 0.25 15.26±0.10plus-or-minus15.260.1015.26\scriptstyle\pm 0.1015.26 ± 0.10 43.85±2.01plus-or-minus43.852.0143.85\scriptstyle\pm 2.0143.85 ± 2.01 18.99±1.00plus-or-minus18.991.0018.99\scriptstyle\pm 1.0018.99 ± 1.00
8 Baseline 77.36±0.22plus-or-minus77.360.2277.36\scriptstyle\pm 0.2277.36 ± 0.22 21.17±0.28plus-or-minus21.170.2821.17\scriptstyle\pm 0.2821.17 ± 0.28 16.29±0.26plus-or-minus16.290.2616.29\scriptstyle\pm 0.2616.29 ± 0.26 n/a 48.19±1.02plus-or-minus48.191.0248.19\scriptstyle\pm 1.0248.19 ± 1.02 23.05±0.45plus-or-minus23.050.4523.05\scriptstyle\pm 0.4523.05 ± 0.45 14.29±0.36plus-or-minus14.290.3614.29\scriptstyle\pm 0.3614.29 ± 0.36 n/a
9 Baseline++ 82.41±0.13plus-or-minus82.410.1382.41\scriptstyle\pm 0.1382.41 ± 0.13 16.31±0.05plus-or-minus16.310.0516.31\scriptstyle\pm 0.0516.31 ± 0.05 41.98±0.12plus-or-minus41.980.1241.98\scriptstyle\pm 0.1241.98 ± 0.12 n/a 51.22±0.35plus-or-minus51.220.3551.22\scriptstyle\pm 0.3551.22 ± 0.35 18.14±0.05plus-or-minus18.140.0518.14\scriptstyle\pm 0.0518.14 ± 0.05 38.27±0.02plus-or-minus38.270.0238.27\scriptstyle\pm 0.0238.27 ± 0.02 n/a

The importance of each component introduced in UniDepth in Sec. 3 is evaluated by ablating the method in Table 5. All ablations exploit the predicted camera representation, if not stated otherwise. The first distinction involves the Oracle model, which operates under ideal conditions with known camera information during training and testing, addressing a task similar to [21, 59]. On the other hand, Baseline is a straightforward encoder-decoder implementation with a (x𝑥xitalic_x,y𝑦yitalic_y,z𝑧zitalic_z) output, as outlined at the beginning of Sec. 3, while Baseline++ exploits the proposed pseudo-spherical representation. Modules’ architectures are consistent across experiments. The In-Domain column reflects testing on validation splits of training domains, while Out-of-Domain corresponds to zero-shot testing, as detailed in Sec. 4.1. Notably, In-Domain results exhibit a higher degree of homogeneity compared to Out-of-Domain, which is noisier yet more informative for gauging expected performances in downstream applications and in-the-wild deployment.

Architecture. The Oracle model demonstrates more robust scale-dependent performance during zero-shot testing compared to the Full model, highlighting how the proposed task is inherently more demanding. The Baseline model illustrates an approach to the problem without utilizing external information and lacking a proper design for both internal and output space. This approach yields markedly inferior results for both In-Domain and Out-of-Domain scenarios in terms of depth and 3D reconstruction metrics.

Camera Module. In Table 5, row 3, the benefit of the Camera Module becomes apparent, revealing a substantial disparity in the effect of this module on scale-invariant and scale-dependent metrics for in- and out-of-domain testing. This disparity stems from the absence of prior knowledge of the model regarding scale, impeding its optimal utilization of the diverse training set. Concentrating solely on predicting depth, rather than a complete 3D output, proves advantageous in averting convergence issues during training. This is evident in comparison with methods predicting 3D, either without reliance on camera information (rows 8 and 9) or influenced by intertwined optimization (row 4), as elucidated in Sec. 3. Refraining from relying on the camera also constrains the model’s capacity to recover a multimodal distribution for out-of-domain samples. The lack of a (bootstrapped) prior prevents the depth module from serving as a corrective mechanism based on an initial scale estimation and imposes an unnecessary computational burden, i.e. recovering the depth values from scratch. This limitation is underscored by the marked variability observed for test sets strongly out-of-distribution, such as KITTI, when comparing the utilization or absence of camera information (rows 2 and 3, respectively). In particular, Full achieves 95.2% in δ1subscript𝛿1\mathrm{\delta}_{1}italic_δ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT in KITTI, while “– Camera” obtains 58.9% for the same test set, despite a mere 2% difference between the two versions on nuScenes and DDAD.

Optimization and Output Representation. All ablations employ the same loss λMSEsubscript𝜆𝑀𝑆𝐸\mathcal{L}_{\lambda MSE}caligraphic_L start_POSTSUBSCRIPT italic_λ italic_M italic_S italic_E end_POSTSUBSCRIPT, but across different output spaces. In row 4, a Cartesian output space is used instead of a pseudo-spherical from Sec. 3.1, which results in substantially inferior performance due to the respective intertwined formulation of camera and depth output spaces. The Baseline (row 8) also employs a Cartesian representation, but the negative impact of this choice is less pronounced in this model because of the absence of a camera module. More specifically, the decoder of Baseline is not conditioned on inaccurate prior camera and scale information as in row 4. Moreover, row 9 corresponds to Baseline with pseudo-spherical representation. Comparison between row 8 and row 9 shows that when predicting directly the 3D outputs, the choice of the output representation is still relevant in defining a better internal representation and optimization. Row 5 demonstrates the positive impact of the geometric invariance loss. This loss contributes to enhanced in-domain and out-of-domain performance by promoting the invariance of depth features to appearance variations owing to different camera intrinsics. Furthermore, stop** the gradient from propagating from the Camera Module to the Encoder (row 7), as described in Sec. 3.4, proves particularly beneficial in avoiding scale and camera overfitting in zero-shot testing, and stabilizes the training. The more stable training is obtained by limiting the dominant effect that camera supervision has on the gradient of the Encoder weights compared to depth supervision.

Camera Representation. In row 6, the model incorporates a sparse camera representation, specifically the pinhole camera model with (fxsubscript𝑓𝑥f_{x}italic_f start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT, fysubscript𝑓𝑦f_{y}italic_f start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT, cxsubscript𝑐𝑥c_{x}italic_c start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT, cysubscript𝑐𝑦c_{y}italic_c start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT), leading to sparse camera prompting and scalar supervision; the camera module still predicts the residual components as outlined in Sec. 3.2. This approach hurts generalization, as evidenced by ARelCsubscriptARel𝐶\mathrm{ARel}_{C}roman_ARel start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT in the out-of-domain evaluation, despite the slight improvement in in-domain ARelCsubscriptARel𝐶\mathrm{ARel}_{C}roman_ARel start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT. We speculate that the four prompts convey less robust information to the depth module than their dense counterpart, resulting in inferior performance for depth metrics compared to Full for both in- and out-of-domain.

5 Conclusion

In this work, we propose UniDepth to predict metric 3D points in diverse scenes relying solely on a single input image. Through meticulous ablation studies, we systematically address the challenges inherent in universal MMDE tasks, underscoring the pivotal contributions of our work. The designed self-prompting camera allows camera-free test time application and renders the model more robust against camera noise. The introduced pseudo-spherical output space representation adequately disentangles the camera and depth of the optimization process. Furthermore, the proposed geometric invariance loss effectively ensures camera-aware depth consistency. Extensive validations unequivocally exhibit how UniDepth sets the new state of the art across multiple benchmarks in a zero-shot regime, even surpassing in-domain trained methods. This attests to the robustness and efficacy of our model and, most importantly, outlines its potential to propel the field of MMDE to new frontiers.

Acknowledgment. This work is funded by Toyota Motor Europe via the research project TRACE-Zürich.

References

  • Antequera et al. [2020] Manuel Lopez Antequera, Pau Gargallo, Markus Hofinger, Samuel Rota Bulò, Yubin Kuang, and Peter Kontschieder. Mapillary planet-scale depth dataset. In The European Conference Computer Vision (ECCV), pages 589–604. Springer International Publishing, 2020.
  • Ba et al. [2016] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
  • Bhat et al. [2020] Shariq Farooq Bhat, Ibraheem Alhashim, and Peter Wonka. Adabins: Depth estimation using adaptive bins. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4008–4017, 2020.
  • Bhat et al. [2023] Shariq Farooq Bhat, Reiner Birkl, Diana Wofk, Peter Wonka, and Matthias Müller. Zoedepth: Zero-shot transfer by combining relative and metric depth. arXiv preprint arXiv:2302.12288, 2023.
  • Caesar et al. [2020] Holger Caesar, Varun Bankiti, Alex H. Lang, Sourabh Vora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. nuscenes: A multimodal dataset for autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
  • Caron et al. [2021] Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9650–9660, 2021.
  • Cordts et al. [2016] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
  • Dai et al. [2017] Angela Dai, Angel X. Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  • Deng et al. [2009] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 248–255, 2009.
  • Deng et al. [2022] Kangle Deng, Andrew Liu, Jun-Yan Zhu, and Deva Ramanan. Depth-supervised nerf: Fewer views and faster training for free. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12882–12891, 2022.
  • Dong et al. [2022] Xingshuai Dong, Matthew A Garratt, Sreenatha G Anavatti, and Hussein A Abbass. Towards real-time monocular depth estimation for robotics: A survey. IEEE Transactions on Intelligent Transportation Systems, 23(10):16940–16961, 2022.
  • Dosovitskiy et al. [2021] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations (ICLR). OpenReview.net, 2021.
  • Eftekhar et al. [2021] Ainaz Eftekhar, Alexander Sax, Jitendra Malik, and Amir Zamir. Omnidata: A scalable pipeline for making multi-task mid-level vision datasets from 3d scans. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 10786–10796, 2021.
  • Eigen et al. [2014] David Eigen, Christian Puhrsch, and Rob Fergus. Depth map prediction from a single image using a multi-scale deep network. pages 2366–2374. Neural information processing systems foundation, 2014.
  • Facil et al. [2019] Jose M Facil, Benjamin Ummenhofer, Huizhong Zhou, Luis Montesano, Thomas Brox, and Javier Civera. Cam-convs: Camera-aware multi-scale convolutions for single-view depth. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11826–11835, 2019.
  • Fu et al. [2018] Huan Fu, Mingming Gong, Chaohui Wang, Kayhan Batmanghelich, and Dacheng Tao. Deep ordinal regression network for monocular depth estimation. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2002–2011, 2018.
  • Garg et al. [2016] Ravi Garg, B. G. Vijay Kumar, Gustavo Carneiro, and Ian Reid. Unsupervised cnn for single view depth estimation: Geometry to the rescue. Lecture Notes in Computer Science, 9912 LNCS:740–756, 2016.
  • Geiger et al. [2012] Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we ready for autonomous driving? The KITTI vision benchmark suite. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2012.
  • Geyer et al. [2020] Jakob Geyer, Yohannes Kassahun, Mentar Mahmudi, Xavier Ricou, Rupesh Durgesh, Andrew S. Chung, Lorenz Hauswald, Viet Hoang Pham, Maximilian Mühlegg, Sebastian Dorn, Tiffany Fernandez, Martin Jänicke, Sudesh Mirashi, Chiragkumar Savani, Martin Sturm, Oleksandr Vorobiov, Martin Oelker, Sebastian Garreis, and Peter Schuberth. A2D2: Audi Autonomous Driving Dataset. arXiv preprint arXiv:2004.06320, 2020.
  • Guizilini et al. [2020] Vitor Guizilini, Rares Ambrus, Sudeep Pillai, Allan Raventos, and Adrien Gaidon. 3d packing for self-supervised monocular depth estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
  • Guizilini et al. [2023] Vitor Guizilini, Igor Vasiljevic, Dian Chen, Rare\textcommabelows Ambru\textcommabelows, and Adrien Gaidon. Towards zero-shot scale-aware monocular depth estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 9233–9243, 2023.
  • He et al. [2015] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2016-December:770–778, 2015.
  • Hendrycks and Gimpel [2016] Dan Hendrycks and Kevin Gimpel. Bridging nonlinearities and stochastic regularizers with gaussian error linear units. CoRR, abs/1606.08415, 2016.
  • Huang et al. [2016] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q. Weinberger. Densely connected convolutional networks. IEEE Conference on Computer Vision and Pattern Recognition, 2017-January:2261–2269, 2016.
  • Jung et al. [2022] HyunJun Jung, Patrick Ruhkamp, Guangyao Zhai, Nikolas Brasch, Yitong Li, Yannick Verdie, Jifei Song, Yiren Zhou, Anil Armagan, Slobodan Ilic, et al. Is my depth ground-truth good enough? HAMMER – Highly Accurate Multi-Modal dataset for dEnse 3D scene Regression. arXiv preprint arXiv:2205.04565, 2022.
  • Koch et al. [2020] Tobias Koch, Lukas Liebel, Marco Körner, and Friedrich Fraundorfer. Comparison of monocular depth estimation methods using geometrically relevant metrics on the IBims-1 dataset. Computer Vision and Image Understanding (CVIU), 191:102877, 2020.
  • Laina et al. [2016] Iro Laina, Christian Rupprecht, Vasileios Belagiannis, Federico Tombari, and Nassir Navab. Deeper depth prediction with fully convolutional residual networks. Proceedings of the International Conference on 3D Vision (3DV), pages 239–248, 2016.
  • Lee et al. [2019] ** Han Lee, Myung-Kyu Han, Dong Wook Ko, and Il Hong Suh. From big to small: Multi-scale local planar guidance for monocular depth estimation. CoRR, abs/1907.10326, 2019.
  • Liu et al. [2023a] Ce Liu, Suryansh Kumar, Shuhang Gu, Radu Timofte, and Luc Van Gool. Single image depth prediction made better: A multivariate gaussian take. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 17346–17356, 2023a.
  • Liu et al. [2023b] Ce Liu, Suryansh Kumar, Shuhang Gu, Radu Timofte, and Luc Van Gool. Va-depthnet: A variational approach to single image depth prediction. arXiv preprint arXiv:2302.06556, 2023b.
  • Liu et al. [2015] Fayao Liu, Chunhua Shen, Guosheng Lin, and Ian Reid. Learning depth from single monocular images using deep convolutional neural fields. IEEE Transactions on Pattern Analysis and Machine Intelligence (T-PAMI), 38:2024–2039, 2015.
  • Liu et al. [2021] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 9992–10002, 2021.
  • Liu et al. [2022] Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, and Saining Xie. A convnet for the 2020s. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11976–11986, 2022.
  • Loshchilov and Hutter [2017] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. 7th International Conference on Learning Representations, ICLR 2019, 2017.
  • Nathan Silberman and Fergus [2012] Pushmeet Kohli Nathan Silberman, Derek Hoiem and Rob Fergus. Indoor segmentation and support inference from rgbd images. In The European Conference Computer Vision (ECCV), 2012.
  • Nickolls et al. [2008] John Nickolls, Ian Buck, Michael Garland, and Kevin Skadron. Scalable parallel programming with cuda: Is cuda the parallel programming model that application developers have been waiting for? Queue, 6(2):40–53, 2008.
  • Örnek et al. [2022] Evin Pınar Örnek, Shristi Mudgal, Johanna Wald, Yida Wang, Nassir Navab, and Federico Tombari. From 2d to 3d: Re-thinking benchmarking of monocular depth prediction. arXiv preprint arXiv:2203.08122, 2022.
  • Park et al. [2021] Dennis Park, Rares Ambrus, Vitor Guizilini, Jie Li, and Adrien Gaidon. Is pseudo-lidar needed for monocular 3d object detection? In IEEE/CVF International Conference on Computer Vision (ICCV), 2021.
  • Paszke et al. [2019] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems (NeurIPS), pages 8024–8035. Curran Associates, Inc., 2019.
  • Patil et al. [2022] Vaishakh Patil, Christos Sakaridis, Alexander Liniger, and Luc Van Gool. P3Depth: Monocular depth estimation with a piecewise planarity prior. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1600–1611. IEEE, 2022.
  • Piccinelli et al. [2023] Luigi Piccinelli, Christos Sakaridis, and Fisher Yu. iDisc: Internal discretization for monocular depth estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
  • Ranftl et al. [2020] René Ranftl, Katrin Lasinger, David Hafner, Konrad Schindler, and Vladlen Koltun. Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. IEEE Transactions on Pattern Analysis and Machine Intelligence (T-PAMI), 44(3):1623–1637, 2020.
  • Ranftl et al. [2021] René Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vision transformers for dense prediction. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 12159–12168, 2021.
  • Schöps et al. [2017] Thomas Schöps, Johannes L. Schönberger, Silvano Galliani, Torsten Sattler, Konrad Schindler, Marc Pollefeys, and Andreas Geiger. A multi-view stereo benchmark with high-resolution images and multi-camera videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  • Shao et al. [2023a] Shuwei Shao, Zhongcai Pei, Weihai Chen, Ran Li, Zhong Liu, and Zhengguo Li. Urcdc-depth: Uncertainty rectified cross-distillation with cutflip for monocular depth estimation. arXiv preprint arXiv:2302.08149, 2023a.
  • Shao et al. [2023b] Shuwei Shao, Zhongcai Pei, Weihai Chen, Xingming Wu, and Zhengguo Li. Nddepth: Normal-distance assisted monocular depth estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 7931–7940, 2023b.
  • Shao et al. [2023c] Shuwei Shao, Zhongcai Pei, Xingming Wu, Zhong Liu, Weihai Chen, and Zhengguo Li. Iebins: Iterative elastic bins for monocular depth estimation. arXiv preprint arXiv:2309.14137, 2023c.
  • Song et al. [2015] Shuran Song, Samuel P. Lichtenberg, and Jianxiong Xiao. Sun rgb-d: A rgb-d scene understanding benchmark suite. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 07-12-June-2015:567–576, 2015.
  • Sun et al. [2020] Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou, Yuning Chai, Benjamin Caine, et al. Scalability in perception for autonomous driving: Waymo open dataset. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2446–2454, 2020.
  • Vasiljevic et al. [2019] Igor Vasiljevic, Nicholas I. Kolkin, Shanyi Zhang, Ruotian Luo, Haochen Wang, Falcon Z. Dai, Andrea F. Daniele, Mohammadreza Mostajabi, Steven Basart, Matthew R. Walter, and Gregory Shakhnarovich. DIODE: A dense indoor and outdoor depth dataset. CoRR, abs/1908.00463, 2019.
  • Wang et al. [2019] Yan Wang, Wei-Lun Chao, Divyansh Garg, Bharath Hariharan, Mark Campbell, and Kilian Q Weinberger. Pseudo-lidar from visual depth estimation: Bridging the gap in 3d object detection for autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8445–8453, 2019.
  • Wang et al. [2020] Yan Wang, Xiangyu Chen, Yurong You, Li Erran Li, Bharath Hariharan, Mark Campbell, Kilian Q. Weinberger, and Wei Lun Chao. Train in germany, test in the usa: Making 3d object detectors generalize. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11710–11720, 2020.
  • Wilson et al. [2021] Benjamin Wilson, William Qi, Tanmay Agarwal, John Lambert, Jagjeet Singh, Siddhesh Khandelwal, Bowen Pan, Ratnesh Kumar, Andrew Hartnett, Jhony Kaesemodel Pontes, Deva Ramanan, Peter Carr, and James Hays. Argoverse 2: Next generation datasets for self-driving perception and forecasting. In Advances in Neural Information Processing Systems, 2021.
  • Wong et al. [2020] Alex Wong, Xiaohan Fei, Stephanie Tsuei, and Stefano Soatto. Unsupervised depth completion from visual inertial odometry. IEEE Robotics and Automation Letters (RA-L), 5(2):1899–1906, 2020.
  • Xie et al. [2019] Cihang Xie, Mingxing Tan, Boqing Gong, Jiang Wang, Alan L. Yuille, and Quoc V. Le. Adversarial examples improve image recognition. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 816–825, 2019.
  • Yang et al. [2019] Guorun Yang, ** Shi, and Bolei Zhou. Drivingstereo: A large-scale dataset for stereo matching in autonomous driving scenarios. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
  • Yang et al. [2021] Guanglei Yang, Hao Tang, Mingli Ding, Nicu Sebe, and Elisa Ricci. Transformer-based attention networks for continuous pixel-wise prediction. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 16249–16259, 2021.
  • Yin et al. [2021] Wei Yin, Jianming Zhang, Oliver Wang, Simon Niklaus, Long Mai, Simon Chen, and Chunhua Shen. Learning to recover 3d scene shape from a single image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 204–213, 2021.
  • Yin et al. [2023] Wei Yin, Chi Zhang, Hao Chen, Zhipeng Cai, Gang Yu, Kaixuan Wang, Xiaozhi Chen, and Chunhua Shen. Metric3d: Towards zero-shot metric 3d prediction from a single image. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 9043–9053, 2023.
  • Yu et al. [2020] Fisher Yu, Haofeng Chen, Xin Wang, Wenqi Xian, Yingying Chen, Fangchen Liu, Vashisht Madhavan, and Trevor Darrell. Bdd100k: A diverse driving dataset for heterogeneous multitask learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2636–2645, 2020.
  • Yuan et al. [2022] Weihao Yuan, Xiaodong Gu, Zuozhuo Dai, Siyu Zhu, and ** Tan. Neural window fully-connected crfs for monocular depth estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3906–3915. IEEE, 2022.
  • Zamir et al. [2018] Amir R Zamir, Alexander Sax, William B Shen, Leonidas Guibas, Jitendra Malik, and Silvio Savarese. Taskonomy: Disentangling task transfer learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2018.
  • Zhou et al. [2019] Brady Zhou, Philipp Krähenbühl, and Vladlen Koltun. Does computer vision matter for action? Science Robotics, 4, 2019.

Supplementary Material

This supplementary material offers further insights into our work. In Sec. A we provide results on the official KITTI benchmark, and standard metric evaluation on KITTI and NYU validation set. Moreover, Sec. B includes additional ablations, namely with VIT backbone and a comparison of different pseudo-spherical representations. In addition, differences between convolutional and ViT-based backbones regarding generalization are discussed. In Sec. C, we describe the datasets used for training and testing and how we propose to amend Diode [50] artifacts at boundaries present in ground-truth depth. We analyze the complexity of UniDepth and compare it with other methods in Sec. D. Furthermore, we describe in Sec. E the network architecture in more detail, necessarily Sec. E overlaps with Sec. 3. Eventually, additional visualizations are provided in Sec. F.

A Results

KITTI benchmark [18]. Table 6 clearly shows the compelling performance of UniDepth on the official KITTI private test set. Results of the latest published methods are reported. The table is fetched from the official KITTI leaderboard for depth prediction. In particular, UniDepth ranks first in the KITTI benchmark at the time of submission among all methods, published and not.

Table 6: Results on official KITTI [18] Benchmark. Comparison of performance of methods trained on KITTI and tested on the official KITTI private test set.
Method SIlogsubscriptSI\mathrm{SI_{\log}}roman_SI start_POSTSUBSCRIPT roman_log end_POSTSUBSCRIPT Sq.Relformulae-sequenceSqRel\mathrm{Sq.Rel}roman_Sq . roman_Rel A.Relformulae-sequenceARel\mathrm{A.Rel}roman_A . roman_Rel iRMSiRMS\mathrm{iRMS}roman_iRMS
Lower is better
MG [29] 9.93 1.68 % 7.99 % 10.63
URCDC-Depth [45] 10.03 1.74 % 8.24 % 10.71
iDisc [41] 9.89 1.77 % 8.11 % 10.73
VA-DepthNet [30] 9.84 1.66 % 7.96 % 10.44
IEBins [47] 9.63 1.60 % 7.82 % 10.68
NDDepth [46] 9.62 1.59 % 7.75 % 10.62
UniDepth 8.13 1.09% 6.54 % 8.24

KITTI Eigen-split and NYUv2-Depth. For the sake of completeness, we report the “standard” metrics results in Table 7 and Table 8 on KITTI Eigen-split and NYU validation set, respectively. It is worth noting that the typical metrics δ2subscript𝛿2\delta_{2}italic_δ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and, especially, δ3subscript𝛿3\delta_{3}italic_δ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT are saturated, thus not informative. Therefore, we advocate our choice of not reporting them in the main paper and prefer to report δ0.5subscript𝛿0.5\delta_{0.5}italic_δ start_POSTSUBSCRIPT 0.5 end_POSTSUBSCRIPT. Moreover, we suggest in future works the use of the area under the curve of the δ𝛿\deltaitalic_δ metrics as a more informative and comprehensive metric, instead of the values at fixed thresholds, i.e. {1.25i}i=13subscriptsuperscriptsuperscript1.25𝑖3𝑖1\{1.25^{i}\}^{3}_{i=1}{ 1.25 start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT } start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT.

Table 7: Comparison on KITTI Eigen-split test set. The first five methods are trained on KITTI and tested on it. The last six methods are tested in a zero-shot setting. UniDepth-{C, V}: UniDepth-{ConvNext [33], ViT [12]}. (†): MiDaS [42] pre-trained. (‡): predicted intrinsics are utilized for conditioning and backprojecting.
Method δ1subscript𝛿1\mathrm{\delta}_{1}italic_δ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT δ2subscript𝛿2\mathrm{\delta}_{2}italic_δ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT δ3subscript𝛿3\mathrm{\delta}_{3}italic_δ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT FAsubscriptF𝐴\mathrm{F}_{A}roman_F start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT A.Relformulae-sequenceARel\mathrm{A.Rel}roman_A . roman_Rel RMSRMS\mathrm{RMS}roman_RMS RMSlogsubscriptRMS\mathrm{RMS}_{\log}roman_RMS start_POSTSUBSCRIPT roman_log end_POSTSUBSCRIPT CDCD\mathrm{CD}roman_CD SIlogsubscriptSI\mathrm{SI}_{\log}roman_SI start_POSTSUBSCRIPT roman_log end_POSTSUBSCRIPT
Higher is better Lower is better
BTS [28] 96.296.296.296.2 99.499.499.499.4 99.8¯¯99.8\underline{99.8}under¯ start_ARG 99.8 end_ARG 82.082.082.082.0 5.635.635.635.63 2.432.432.432.43 0.0890.0890.0890.089 0.420.420.420.42 8.188.188.188.18
AdaBins [3] 96.396.396.396.3 99.599.599.599.5 99.8¯¯99.8\underline{99.8}under¯ start_ARG 99.8 end_ARG 81.581.581.581.5 5.855.855.855.85 2.382.382.382.38 0.0890.0890.0890.089 0.4290.4290.4290.429 8.108.108.108.10
NeWCRF [61] 97.597.597.597.5 99.7¯¯99.7\underline{99.7}under¯ start_ARG 99.7 end_ARG 99.999.9\mathbf{99.9}bold_99.9 82.782.782.782.7 5.205.205.205.20 2.072.072.072.07 0.0780.0780.0780.078 0.3880.3880.3880.388 7.007.007.007.00
iDisc [41] 97.597.597.597.5 99.7¯¯99.7\underline{99.7}under¯ start_ARG 99.7 end_ARG 99.999.9\mathbf{99.9}bold_99.9 83.183.183.183.1 5.095.095.095.09 2.072.072.072.07 0.0770.0770.0770.077 0.3800.3800.3800.380 7.117.117.117.11
ZoeDepth [4] 96.596.596.596.5 99.199.199.199.1 99.499.499.499.4 82.182.182.182.1 5.765.765.765.76 2.392.392.392.39 0.0890.0890.0890.089 0.4310.4310.4310.431 7.477.477.477.47
Metric3D [59] 97.597.597.597.5 99.599.599.599.5 99.8¯¯99.8\underline{99.8}under¯ start_ARG 99.8 end_ARG 82.982.982.982.9 5.335.335.335.33 2.262.262.262.26 0.0810.0810.0810.081 0.3920.3920.3920.392 7.287.287.287.28
Ours-C 97.8¯¯97.8\underline{97.8}under¯ start_ARG 97.8 end_ARG 99.7¯¯99.7\underline{99.7}under¯ start_ARG 99.7 end_ARG 99.999.9\mathbf{99.9}bold_99.9 83.9¯¯83.9\underline{83.9}under¯ start_ARG 83.9 end_ARG 4.69¯¯4.69\underline{4.69}under¯ start_ARG 4.69 end_ARG 2.00¯¯2.00\underline{2.00}under¯ start_ARG 2.00 end_ARG 0.073¯¯0.073\underline{0.073}under¯ start_ARG 0.073 end_ARG 0.371¯¯0.371\underline{0.371}under¯ start_ARG 0.371 end_ARG 6.72¯¯6.72\underline{6.72}under¯ start_ARG 6.72 end_ARG
Ours-V 98.698.6\mathbf{98.6}bold_98.6 99.899.8\mathbf{99.8}bold_99.8 99.999.9\mathbf{99.9}bold_99.9 85.085.0\mathbf{85.0}bold_85.0 4.214.21\mathbf{4.21}bold_4.21 1.751.75\mathbf{1.75}bold_1.75 0.0640.064\mathbf{0.064}bold_0.064 0.3380.338\mathbf{0.338}bold_0.338 5.845.84\mathbf{5.84}bold_5.84
Ours-C  97.897.8\mathbf{97.8}bold_97.8 99.7¯¯99.7\underline{99.7}under¯ start_ARG 99.7 end_ARG 99.999.9\mathbf{99.9}bold_99.9 80.880.880.880.8 4.774.774.774.77 2.002.002.002.00 0.0730.0730.0730.073 0.4270.4270.4270.427 6.726.726.726.72
Ours-V  98.698.6\mathbf{98.6}bold_98.6 99.899.8\mathbf{99.8}bold_99.8 99.999.9\mathbf{99.9}bold_99.9 82.782.782.782.7 4.214.21\mathbf{4.21}bold_4.21 1.751.75\mathbf{1.75}bold_1.75 0.0640.064\mathbf{0.064}bold_0.064 0.3810.3810.3810.381 5.845.84\mathbf{5.84}bold_5.84
Table 8: Comparison on NYU validation set. The first five methods are trained on NYU and tested on it. The last six methods are tested in a zero-shot setting. UniDepth-{C, V}: UniDepth-{ConvNext [33], ViT [12]}. (†): MiDaS [42] pre-trained. (‡): predicted intrinsics are utilized for conditioning and backprojecting.
Method δ1subscript𝛿1\mathrm{\delta}_{1}italic_δ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT δ2subscript𝛿2\mathrm{\delta}_{2}italic_δ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT δ3subscript𝛿3\mathrm{\delta}_{3}italic_δ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT FAsubscriptF𝐴\mathrm{F}_{A}roman_F start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT A.Relformulae-sequenceARel\mathrm{A.Rel}roman_A . roman_Rel RMSRMS\mathrm{RMS}roman_RMS Log10subscriptLog10\mathrm{Log}_{10}roman_Log start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT CDCD\mathrm{CD}roman_CD SIlogsubscriptSI\mathrm{SI}_{\log}roman_SI start_POSTSUBSCRIPT roman_log end_POSTSUBSCRIPT
Higher is better Lower is better
BTS [28] 88.588.588.588.5 97.897.8\mathbf{97.8}bold_97.8 99.499.499.499.4 74.074.074.074.0 10.910.910.910.9 0.3910.3910.3910.391 0.0460.0460.0460.046 0.1600.1600.1600.160 11.511.511.511.5
AdaBins [3] 90.190.190.190.1 98.398.398.398.3 99.699.699.699.6 74.774.774.774.7 10.310.310.310.3 0.3650.3650.3650.365 0.0440.0440.0440.044 0.1560.1560.1560.156 10.610.610.610.6
NeWCRF [61] 92.192.192.192.1 99.199.199.199.1 99.8¯¯99.8\underline{99.8}under¯ start_ARG 99.8 end_ARG 75.875.875.875.8 9.569.569.569.56 0.3330.3330.3330.333 0.0400.0400.0400.040 0.1470.1470.1470.147 9.169.169.169.16
iDisc [41] 93.893.893.893.8 99.299.299.299.2 99.8¯¯99.8\underline{99.8}under¯ start_ARG 99.8 end_ARG 78.278.278.278.2 8.618.618.618.61 0.3130.3130.3130.313 0.0370.0370.0370.037 0.1330.1330.1330.133 8.858.858.858.85
ZoeDepth [4] 95.295.295.295.2 99.599.599.599.5 99.8¯¯99.8\underline{99.8}under¯ start_ARG 99.8 end_ARG 80.180.180.180.1 7.707.707.707.70 0.2780.2780.2780.278 0.0330.0330.0330.033 0.1250.1250.1250.125 7.197.197.197.19
Metric3D [59] 92.692.692.692.6 97.997.997.997.9 99.199.199.199.1 77.877.877.877.8 9.389.389.389.38 0.3370.3370.3370.337 0.0380.0380.0380.038 0.1460.1460.1460.146 9.139.139.139.13
Ours-C 97.297.297.297.2 99.6¯¯99.6\underline{99.6}under¯ start_ARG 99.6 end_ARG 99.999.9\mathbf{99.9}bold_99.9 84.484.484.484.4 6.226.226.226.22 0.2310.2310.2310.231 0.0260.0260.0260.026 0.1010.1010.1010.101 6.396.396.396.39
Ours-V 98.498.4\mathbf{98.4}bold_98.4 99.7¯¯99.7\underline{99.7}under¯ start_ARG 99.7 end_ARG 99.999.9\mathbf{99.9}bold_99.9 85.985.9\mathbf{85.9}bold_85.9 5.785.78\mathbf{5.78}bold_5.78 0.2010.201\mathbf{0.201}bold_0.201 0.0240.024\mathbf{0.024}bold_0.024 0.0920.092\mathbf{0.092}bold_0.092 5.275.27\mathbf{5.27}bold_5.27
Ours-C 97.297.297.297.2 99.6¯¯99.6\underline{99.6}under¯ start_ARG 99.6 end_ARG 99.999.9\mathbf{99.9}bold_99.9 84.184.184.184.1 6.336.336.336.33 0.2320.2320.2320.232 0.0270.0270.0270.027 0.1030.1030.1030.103 6.406.406.406.40
Ours-V 98.3¯¯98.3\underline{98.3}under¯ start_ARG 98.3 end_ARG 99.7¯¯99.7\underline{99.7}under¯ start_ARG 99.7 end_ARG 99.999.9\mathbf{99.9}bold_99.9 85.5¯¯85.5\underline{85.5}under¯ start_ARG 85.5 end_ARG 6.04¯¯6.04\underline{6.04}under¯ start_ARG 6.04 end_ARG 0.205¯¯0.205\underline{0.205}under¯ start_ARG 0.205 end_ARG 0.025¯¯0.025\underline{0.025}under¯ start_ARG 0.025 end_ARG 0.094¯¯0.094\underline{0.094}under¯ start_ARG 0.094 end_ARG 5.28¯¯5.28\underline{5.28}under¯ start_ARG 5.28 end_ARG

B Ablations

Table 9: Ablations of UniDepth. In-Domain corresponds to the union of the training domain’s validation sets, while Out-of-Domain involves the union of zero-shot testing sets. Oracle is the model with provided GT cameras at training and test time. Baseline directly predicts 3D points in Cartesian space, Baseline++ in pseudo-spherical. Full represents the final UniDepth. All models have the same depth and camera module architecture, if any. ARelCsubscriptARel𝐶\mathrm{ARel}_{C}roman_ARel start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT is the mean of elementwise absolute relative error for camera intrinsics. (†): GT camera intrinsics utilized for backprojection. The backbone used is ViT-L [12]. Medians and median average deviations over three runs are reported.
Ablation In-Domain Out-of-Domain
δ1subscript𝛿1absent\mathrm{\delta}_{1}\uparrowitalic_δ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ↑ SIlogsubscriptSIabsent\mathrm{SI}_{\log}\downarrowroman_SI start_POSTSUBSCRIPT roman_log end_POSTSUBSCRIPT ↓ FAsubscriptF𝐴absent\mathrm{F}_{A}\uparrowroman_F start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ↑ ARelCsubscriptARel𝐶absent\mathrm{ARel}_{C}\downarrowroman_ARel start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ↓ δ1subscript𝛿1absent\mathrm{\delta}_{1}\uparrowitalic_δ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ↑ SIlogsubscriptSIabsent\mathrm{SI}_{\log}\downarrowroman_SI start_POSTSUBSCRIPT roman_log end_POSTSUBSCRIPT ↓ FAsubscriptF𝐴absent\mathrm{F}_{A}\uparrowroman_F start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ↑ ARelCsubscriptARel𝐶absent\mathrm{ARel}_{C}\downarrowroman_ARel start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ↓
1 Oracle 91.46±0.09plus-or-minus91.460.0991.46\scriptstyle\pm 0.0991.46 ± 0.09 12.12±0.02plus-or-minus12.120.0212.12\scriptstyle\pm 0.0212.12 ± 0.02 68.35±0.14plus-or-minus68.350.1468.35\scriptstyle\pm 0.1468.35 ± 0.14 n/a 72.17±0.44plus-or-minus72.170.4472.17\scriptstyle\pm 0.4472.17 ± 0.44 13.07±0.01plus-or-minus13.070.0113.07\scriptstyle\pm 0.0113.07 ± 0.01 59.84±0.18plus-or-minus59.840.1859.84\scriptstyle\pm 0.1859.84 ± 0.18 n/a
2 Full 91.43±0.05plus-or-minus91.430.0591.43\scriptstyle\pm 0.0591.43 ± 0.05 12.06±0.06plus-or-minus12.060.0612.06\scriptstyle\pm 0.0612.06 ± 0.06 65.44±0.84plus-or-minus65.440.8465.44\scriptstyle\pm 0.8465.44 ± 0.84 2.19±0.14plus-or-minus2.190.142.19\scriptstyle\pm 0.142.19 ± 0.14 64.45±0.52plus-or-minus64.450.5264.45\scriptstyle\pm 0.5264.45 ± 0.52 13.0±0.02plus-or-minus13.00.0213.0\scriptstyle\pm 0.0213.0 ± 0.02 52.46±0.29plus-or-minus52.460.2952.46\scriptstyle\pm 0.2952.46 ± 0.29 12.31±0.61plus-or-minus12.310.6112.31\scriptstyle\pm 0.6112.31 ± 0.61
3 – Camera 89.33±0.04plus-or-minus89.330.0489.33\scriptstyle\pm 0.0489.33 ± 0.04 12.54±0.04plus-or-minus12.540.0412.54\scriptstyle\pm 0.0412.54 ± 0.04 66.02±0.27plus-or-minus66.020.2766.02\scriptstyle\pm 0.2766.02 ± 0.27 n/a 60.67±0.22plus-or-minus60.670.2260.67\scriptstyle\pm 0.2260.67 ± 0.22 13.4±0.07plus-or-minus13.40.0713.4\scriptstyle\pm 0.0713.4 ± 0.07 52.43±0.08plus-or-minus52.430.0852.43\scriptstyle\pm 0.0852.43 ± 0.08 n/a
4 consubscriptcon\mathcal{L}_{\mathrm{con}}caligraphic_L start_POSTSUBSCRIPT roman_con end_POSTSUBSCRIPT 90.27±0.13plus-or-minus90.270.1390.27\scriptstyle\pm 0.1390.27 ± 0.13 12.21±0.01plus-or-minus12.210.0112.21\scriptstyle\pm 0.0112.21 ± 0.01 63.28±0.66plus-or-minus63.280.6663.28\scriptstyle\pm 0.6663.28 ± 0.66 1.92±0.31plus-or-minus1.920.311.92\scriptstyle\pm 0.311.92 ± 0.31 61.98±0.41plus-or-minus61.980.4161.98\scriptstyle\pm 0.4161.98 ± 0.41 13.24±0.04plus-or-minus13.240.0413.24\scriptstyle\pm 0.0413.24 ± 0.04 50.91±0.16plus-or-minus50.910.1650.91\scriptstyle\pm 0.1650.91 ± 0.16 13.11±0.36plus-or-minus13.110.3613.11\scriptstyle\pm 0.3613.11 ± 0.36
5 – Spherical 32.92±0.18plus-or-minus32.920.1832.92\scriptstyle\pm 0.1832.92 ± 0.18 18.11±0.08plus-or-minus18.110.0818.11\scriptstyle\pm 0.0818.11 ± 0.08 33.62±0.07plus-or-minus33.620.0733.62\scriptstyle\pm 0.0733.62 ± 0.07 21.64±0.2plus-or-minus21.640.221.64\scriptstyle\pm 0.221.64 ± 0.2 48.43±1.27plus-or-minus48.431.2748.43\scriptstyle\pm 1.2748.43 ± 1.27 18.53±0.35plus-or-minus18.530.3518.53\scriptstyle\pm 0.3518.53 ± 0.35 42.85±1.18plus-or-minus42.851.1842.85\scriptstyle\pm 1.1842.85 ± 1.18 17.16±0.79plus-or-minus17.160.7917.16\scriptstyle\pm 0.7917.16 ± 0.79
6 – Dense 90.16±0.15plus-or-minus90.160.1590.16\scriptstyle\pm 0.1590.16 ± 0.15 12.23±0.01plus-or-minus12.230.0112.23\scriptstyle\pm 0.0112.23 ± 0.01 64.19±0.03plus-or-minus64.190.0364.19\scriptstyle\pm 0.0364.19 ± 0.03 1.83±0.18plus-or-minus1.830.181.83\scriptstyle\pm 0.181.83 ± 0.18 62.44±0.19plus-or-minus62.440.1962.44\scriptstyle\pm 0.1962.44 ± 0.19 13.36±0.04plus-or-minus13.360.0413.36\scriptstyle\pm 0.0413.36 ± 0.04 49.34±0.28plus-or-minus49.340.2849.34\scriptstyle\pm 0.2849.34 ± 0.28 13.68±0.61plus-or-minus13.680.6113.68\scriptstyle\pm 0.6113.68 ± 0.61
7 – Detach 89.93±0.02plus-or-minus89.930.0289.93\scriptstyle\pm 0.0289.93 ± 0.02 12.58±0.04plus-or-minus12.580.0412.58\scriptstyle\pm 0.0412.58 ± 0.04 66.30±0.35plus-or-minus66.300.3566.30\scriptstyle\pm 0.3566.30 ± 0.35 0.94±0.03plus-or-minus0.940.030.94\scriptstyle\pm 0.030.94 ± 0.03 51.77±0.09plus-or-minus51.770.0951.77\scriptstyle\pm 0.0951.77 ± 0.09 13.45±0.02plus-or-minus13.450.0213.45\scriptstyle\pm 0.0213.45 ± 0.02 49.91±0.01plus-or-minus49.910.0149.91\scriptstyle\pm 0.0149.91 ± 0.01 14.87±0.22plus-or-minus14.870.2214.87\scriptstyle\pm 0.2214.87 ± 0.22
8 Baseline 21.26±0.23plus-or-minus21.260.2321.26\scriptstyle\pm 0.2321.26 ± 0.23 23.43±0.45plus-or-minus23.430.4523.43\scriptstyle\pm 0.4523.43 ± 0.45 29.19±0.09plus-or-minus29.190.0929.19\scriptstyle\pm 0.0929.19 ± 0.09 n/a 34.15±0.74plus-or-minus34.150.7434.15\scriptstyle\pm 0.7434.15 ± 0.74 20.39±0.42plus-or-minus20.390.4220.39\scriptstyle\pm 0.4220.39 ± 0.42 40.14±0.52plus-or-minus40.140.5240.14\scriptstyle\pm 0.5240.14 ± 0.52 n/a
9 Baseline++ 88.84±0.11plus-or-minus88.840.1188.84\scriptstyle\pm 0.1188.84 ± 0.11 12.93±0.11plus-or-minus12.930.1112.93\scriptstyle\pm 0.1112.93 ± 0.11 42.72±0.10plus-or-minus42.720.1042.72\scriptstyle\pm 0.1042.72 ± 0.10 n/a 59.31±0.58plus-or-minus59.310.5859.31\scriptstyle\pm 0.5859.31 ± 0.58 14.04±0.03plus-or-minus14.040.0314.04\scriptstyle\pm 0.0314.04 ± 0.03 44.12±0.10plus-or-minus44.120.1044.12\scriptstyle\pm 0.1044.12 ± 0.10 n/a

B.1 Ablations with ViT backbone

Ablations with ViT backbone are provided in Table 9. The trend in Table 9 is consistent with the one outlined for the convolutional backbone. More specifically, the ablated components contribute similarly between ViT-L [12] and ConvNext-L [33] backbones. However, utilizing a ViT backbone shows a larger variability for out-of-domain results, also showing a stronger effect of the usage of pseudo-spherical representation both for the Baseline and Full. The increased susceptibility of the scene’s depth scale to domain shift is also related to the backbone comparison in Table 1. In particular, zero-shot results suggest that the convolutional architecture exhibits superior resilience to scale-related domain shifts, although showing relative disadvantage in handling appearance-related domain shifts. SIlogsubscriptSI\mathrm{SI}_{\log}roman_SI start_POSTSUBSCRIPT roman_log end_POSTSUBSCRIPT consistently favors ViT over convolutional methods, emphasizing the latter’s diminished performance in appearance domain shifts. However, scale-dependent metrics do not consistently favor ViT, indicating that the constrained receptive field of convolutional methods yields higher robustness to domain shifts associated with scale.

B.2 Alternative pseudo-spherical representation

Table 10: Ablations of specific pseudo-spherical representation. In-Domain corresponds to the union of the training domain’s validation sets, while Out-of-Domain involves the union of zero-shot testing sets. All models have the same depth and camera module architecture. ARelCsubscriptARel𝐶\mathrm{ARel}_{C}roman_ARel start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT is the mean of elementwise absolute relative error for camera intrinsics. Medians and median average deviations over three runs are reported.
Ablation Backbone In-Domain Out-of-Domain
δ1subscript𝛿1absent\mathrm{\delta}_{1}\uparrowitalic_δ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ↑ SIlogsubscriptSIabsent\mathrm{SI}_{\log}\downarrowroman_SI start_POSTSUBSCRIPT roman_log end_POSTSUBSCRIPT ↓ FAsubscriptF𝐴absent\mathrm{F}_{A}\uparrowroman_F start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ↑ ARelCsubscriptARel𝐶absent\mathrm{ARel}_{C}\downarrowroman_ARel start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ↓ δ1subscript𝛿1absent\mathrm{\delta}_{1}\uparrowitalic_δ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ↑ SIlogsubscriptSIabsent\mathrm{SI}_{\log}\downarrowroman_SI start_POSTSUBSCRIPT roman_log end_POSTSUBSCRIPT ↓ FAsubscriptF𝐴absent\mathrm{F}_{A}\uparrowroman_F start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ↑ ARelCsubscriptARel𝐶absent\mathrm{ARel}_{C}\downarrowroman_ARel start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ↓
UniDepth ViT-L [12] 91.43±0.05plus-or-minus91.430.0591.43\scriptstyle\pm 0.0591.43 ± 0.05 12.06±0.06plus-or-minus12.060.0612.06\scriptstyle\pm 0.0612.06 ± 0.06 65.44±0.84plus-or-minus65.440.8465.44\scriptstyle\pm 0.8465.44 ± 0.84 2.19±0.14plus-or-minus2.190.142.19\scriptstyle\pm 0.142.19 ± 0.14 64.45±0.52plus-or-minus64.450.5264.45\scriptstyle\pm 0.5264.45 ± 0.52 13.00±0.02plus-or-minus13.000.0213.00\scriptstyle\pm 0.0213.00 ± 0.02 52.46±0.29plus-or-minus52.460.2952.46\scriptstyle\pm 0.2952.46 ± 0.29 12.31±0.61plus-or-minus12.310.6112.31\scriptstyle\pm 0.6112.31 ± 0.61
UniDepth rays ViT-L [12] 90.93±0.02plus-or-minus90.930.0290.93\scriptstyle\pm 0.0290.93 ± 0.02 12.19±0.06plus-or-minus12.190.0612.19\scriptstyle\pm 0.0612.19 ± 0.06 64.70±0.05plus-or-minus64.700.0564.70\scriptstyle\pm 0.0564.70 ± 0.05 2.44±0.11plus-or-minus2.440.112.44\scriptstyle\pm 0.112.44 ± 0.11 65.50±0.81plus-or-minus65.500.8165.50\scriptstyle\pm 0.8165.50 ± 0.81 13.03±0.01plus-or-minus13.030.0113.03\scriptstyle\pm 0.0113.03 ± 0.01 53.12±0.02plus-or-minus53.120.0253.12\scriptstyle\pm 0.0253.12 ± 0.02 11.82±0.99plus-or-minus11.820.9911.82\scriptstyle\pm 0.9911.82 ± 0.99
UniDepth ConvNext-L [33] 88.89±0.10plus-or-minus88.890.1088.89\scriptstyle\pm 0.1088.89 ± 0.10 13.13±0.01plus-or-minus13.130.0113.13\scriptstyle\pm 0.0113.13 ± 0.01 63.52±0.08plus-or-minus63.520.0863.52\scriptstyle\pm 0.0863.52 ± 0.08 2.05±0.01plus-or-minus2.050.012.05\scriptstyle\pm 0.012.05 ± 0.01 57.06±1.48plus-or-minus57.061.4857.06\scriptstyle\pm 1.4857.06 ± 1.48 14.83±0.04plus-or-minus14.830.0414.83\scriptstyle\pm 0.0414.83 ± 0.04 49.71±0.55plus-or-minus49.710.5549.71\scriptstyle\pm 0.5549.71 ± 0.55 13.54±0.85plus-or-minus13.540.8513.54\scriptstyle\pm 0.8513.54 ± 0.85
UniDepth rays ConvNext-L [33] 88.55±0.31plus-or-minus88.550.3188.55\scriptstyle\pm 0.3188.55 ± 0.31 13.24±0.10plus-or-minus13.240.1013.24\scriptstyle\pm 0.1013.24 ± 0.10 62.58±1.11plus-or-minus62.581.1162.58\scriptstyle\pm 1.1162.58 ± 1.11 2.74±0.13plus-or-minus2.740.132.74\scriptstyle\pm 0.132.74 ± 0.13 55.10±0.39plus-or-minus55.100.3955.10\scriptstyle\pm 0.3955.10 ± 0.39 14.91±0.01plus-or-minus14.910.0114.91\scriptstyle\pm 0.0114.91 ± 0.01 46.38±0.61plus-or-minus46.380.6146.38\scriptstyle\pm 0.6146.38 ± 0.61 15.00±0.36plus-or-minus15.000.3615.00\scriptstyle\pm 0.3615.00 ± 0.36

Sec. 3 focuses on describing the pseudo-spherical representation chosen to disentangle the two sub-tasks, namely calibration and depth estimation, and ablations studies confirm the effectiveness of disentangling the sub-tasks. In particular, UniDepth exploits an angular pseudo-spherical representation, namely based on azimuth, elevation angle, and log-depth, i.e. (θ𝜃\thetaitalic_θ, ϕitalic-ϕ\phiitalic_ϕ, zlogsubscript𝑧z_{\log}italic_z start_POSTSUBSCRIPT roman_log end_POSTSUBSCRIPT). Nevertheless, an alternative solution to disentangle the two different sub-tasks, namely calibration and depth estimation, is to exploit the bearing vector and log-depth. More specifically, a bearing vector corresponds to the unit-length ray represented by (rxsubscript𝑟𝑥r_{x}italic_r start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT, rysubscript𝑟𝑦r_{y}italic_r start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT, rzsubscript𝑟𝑧r_{z}italic_r start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT) 𝕊2absentsuperscript𝕊2\in\mathbb{S}^{2}∈ blackboard_S start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, with 𝕊2superscript𝕊2\mathbb{S}^{2}blackboard_S start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT corresponding to the unit-sphere manifold. The bearing vectors are obtained as the unprojection of image coordinates based on the (pinhole) camera model. With this design, the output is represented by the tuple (rxsubscript𝑟𝑥r_{x}italic_r start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT, rysubscript𝑟𝑦r_{y}italic_r start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT, rzsubscript𝑟𝑧r_{z}italic_r start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT, zlogsubscript𝑧z_{\log}italic_z start_POSTSUBSCRIPT roman_log end_POSTSUBSCRIPT) and the loss λMSEsubscript𝜆MSE\mathcal{L}_{\mathrm{\lambda MSE}}caligraphic_L start_POSTSUBSCRIPT italic_λ roman_MSE end_POSTSUBSCRIPT is applied seamlessly as depicted in Sec. 3, but with λrx=λry=λrz=1subscript𝜆subscript𝑟𝑥subscript𝜆subscript𝑟𝑦subscript𝜆subscript𝑟𝑧1\lambda_{r_{x}}=\lambda_{r_{y}}=\lambda_{r_{z}}=1italic_λ start_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT = italic_λ start_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT end_POSTSUBSCRIPT = italic_λ start_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT end_POSTSUBSCRIPT = 1 and λz=0.15subscript𝜆𝑧0.15\lambda_{z}=0.15italic_λ start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT = 0.15.

However, the disentanglement in rays and log-depth can be viewed as an alternative pseudo-spherical representation, in fact, rays and angles share a direct relationship θ=arctan(rxrz)𝜃arctansubscript𝑟𝑥subscript𝑟𝑧\theta=\mathrm{arctan}(\frac{r_{x}}{r_{z}})italic_θ = roman_arctan ( divide start_ARG italic_r start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_ARG start_ARG italic_r start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT end_ARG ) and ϕ=arccos(ry)italic-ϕarccossubscript𝑟𝑦\phi=\mathrm{arccos}(r_{y})italic_ϕ = roman_arccos ( italic_r start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ). Table 10 explores the effectiveness of this alternative representation and compares to the one presented in Sec. 3. The ablation study reported in Table 10 highlights how the difference between the two representations is marginal and, in most cases, within the uncertainty range, thus proving their similarity. The main difference lies in the output space dimensionality. In principle, the bearing vectors would span the entire 3superscript3\mathbb{R}^{3}blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT space. However, the space is constrained to the unit-sphere manifold by L2subscriptL2\mathrm{L}_{2}roman_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT normalization.

Furthermore, we ablate our camera prompting with respect to CAMConvs [15] in Table 11

Table 11: Ablate UniDepth with CAMConvs. Full is complete UniDepth, as row 2 in Tab. 5. w/ CAMConvs represents UniDepth with CAMConvs [15] conditioning instead of our prompting.
Ablation In-Domain Out-of-Domain
δ1subscript𝛿1absent\mathrm{\delta}_{1}\uparrowitalic_δ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ↑ SIlogsubscriptSIabsent\mathrm{SI}_{\log}\downarrowroman_SI start_POSTSUBSCRIPT roman_log end_POSTSUBSCRIPT ↓ FAsubscriptF𝐴absent\mathrm{F}_{A}\uparrowroman_F start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ↑ ARelCsubscriptARel𝐶absent\mathrm{ARel}_{C}\downarrowroman_ARel start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ↓ δ1subscript𝛿1absent\mathrm{\delta}_{1}\uparrowitalic_δ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ↑ SIlogsubscriptSIabsent\mathrm{SI}_{\log}\downarrowroman_SI start_POSTSUBSCRIPT roman_log end_POSTSUBSCRIPT ↓ FAsubscriptF𝐴absent\mathrm{F}_{A}\uparrowroman_F start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ↑ ARelCsubscriptARel𝐶absent\mathrm{ARel}_{C}\downarrowroman_ARel start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ↓
w/ CAMConvs 87.8187.8187.8187.81 13.4913.4913.4913.49 60.9060.9060.9060.90 2.552.552.552.55 54.6554.6554.6554.65 15.3715.3715.3715.37 43.0943.0943.0943.09 16.1116.1116.1116.11
Full 88.8988.89\mathbf{88.89}bold_88.89 13.1313.13\mathbf{13.13}bold_13.13 63.5263.52\mathbf{63.52}bold_63.52 2.052.05\mathbf{2.05}bold_2.05 57.0657.06\mathbf{57.06}bold_57.06 14.8314.83\mathbf{14.83}bold_14.83 49.7149.71\mathbf{49.71}bold_49.71 13.5413.54\mathbf{13.54}bold_13.54
Algorithm 1 GT depth boundaries refining.
procedure BoundaryRefine(𝐙logsubscript𝐙\mathbf{Z}_{\log}bold_Z start_POSTSUBSCRIPT roman_log end_POSTSUBSCRIPT)
     𝐋=Laplacian(𝐙log,k=5)𝐋Laplaciansubscript𝐙𝑙𝑜𝑔𝑘5\mathbf{L}=\mathrm{Laplacian}(\mathbf{Z}_{log},k=5)bold_L = roman_Laplacian ( bold_Z start_POSTSUBSCRIPT italic_l italic_o italic_g end_POSTSUBSCRIPT , italic_k = 5 )
     𝐌=𝕀[𝐋10%𝐋𝐋90%]𝐌𝕀delimited-[]subscript𝐋percent10𝐋subscript𝐋percent90\mathbf{M}=\mathbb{I}[\mathbf{L}_{10\%}\leq\mathbf{L}\leq\mathbf{L}_{90\%}]bold_M = blackboard_I [ bold_L start_POSTSUBSCRIPT 10 % end_POSTSUBSCRIPT ≤ bold_L ≤ bold_L start_POSTSUBSCRIPT 90 % end_POSTSUBSCRIPT ] \triangleright Compute Laplacian and threshold at 10-90 percentile
     𝐌=(𝐌eye3)eye3𝐌direct-sumsymmetric-difference𝐌subscripteye3subscripteye3\mathbf{M}=(\mathbf{M}\ominus\mathrm{eye}_{3})\oplus\mathrm{eye}_{3}bold_M = ( bold_M ⊖ roman_eye start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) ⊕ roman_eye start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT \triangleright Opening with size 3
     𝐌=MedianBlur(𝐌,k=3)𝐌MedianBlur𝐌𝑘3\mathbf{M}=\mathrm{MedianBlur}(\mathbf{M},k=3)bold_M = roman_MedianBlur ( bold_M , italic_k = 3 )
     𝐙=exp(𝐙log)𝐌𝐙subscript𝐙𝐌\mathbf{Z}=\exp(\mathbf{Z}_{\log})\cdot\mathbf{M}bold_Z = roman_exp ( bold_Z start_POSTSUBSCRIPT roman_log end_POSTSUBSCRIPT ) ⋅ bold_M
     return 𝐙𝐙\mathbf{Z}bold_Z
Table 12: Datasets List. List of the training and testing datasets: number of images, scene type, and method of acquisition are reported. SfM: Structure-from-Motion. MVS: Multi-View Stereo.
Dataset Images Scene Acquisition
Training Set A2D2 [19] 78k Outdoor LiDAR
Argoverse2 [53] 403k Outdoor LiDAR
BDD100k [60] 270k Outdoor SfM
CityScapes [7] 24k Outdoor MVS
DrivingStereo [56] 63k Outdoor MVS
Mapillary PSD [1] 742k Outdoor SfM
ScanNet [8] 83k Indoor RGB-D
Taskonomy [62] 1940k Indoor RGB-D
Waymo [49] 223k Outdoor LiDAR
Testing Set DDAD [20] 1002 Outdoor LiDAR
Diode [50] 325 Indoor LiDAR
ETH3D [44] 454 Outdoor RGB-D
HAMMER [25] 496 Indoor Mix
IBims-1 [26] 100 Indoor RGB-D
KITTI [18] 652 Outdoor LiDAR
NuScenes [5] 3k Outdoor LiDAR
NYU [35] 654 Indoor RGB-D
SUN-RGBD [48] 4.4k Indoor RGB-D
VOID [54] 800 Indoor RGB-D

C Datasets

C.1 Datasets details

Details of training and testing datasets are presented in Table 12. The training datasets are processed in a way that the interval between two consecutive RGB and GT depth frames is not smaller than one second. We do not apply any post-processing apart from the aforementioned subsampling. The total amount of training samples accounts for 3’743’000 samples. SUN-RGBD [48] validation set involves also NYU [35] test set. Therefore, we removed the samples corresponding to NYU test set to avoid any overlap between test sets. As per standard practice, KITTI Eigen-split corresponds to the corrected and accumulated GT depth maps with 45 images with inaccurate GT discarded from the original 697 images.

C.2 Diode Indoor ground-truth correction

Diode [50] ground-truth depth is not perfectly accurate on boundaries, in particular, a simple inspection shows how depth in boundaries presents low values, but greater than zero. These artifacts present in the GT affect the validation pipeline and results. Therefore, we design a simple image processing algorithm, outlined in Algorithm 1, that, first, detects the aforementioned boundary artifacts and, second, masks the depth in the corresponding neighborhoods. Thanks to masking those boundaries, the corresponding regions are ignored during validation.

D Model Complexity

Table 13 displays the parameters and inference complexity of UniDepth and other SotA methods. UniDepth with ViT-L backbone is comparable to ZoeDepth in terms of efficiency and model parameters; however UniDepth surpasses it in terms of performance as stated in Sec. 4. Metric3D displays an improved efficiency due to the fully convolutional and relatively low dimensionality designed in the decoder. It is worth highlighting how ZeroDepth presents a low efficiency although based on ResNet-18, we argue that this is due to the expensive full-resolution cross-attention in the decoder. The last two rows in Table 13 analyze separately the complexity of the single Camera and Depth Module. The Camera Module is a lightweight component accounting for 13.4M parameters. On the other hand, the Depth Module amounts to more than half of the total latency, despite the limited memory consumption. The Depth Module’s high latency is due to the several (6) self-attention layers in the decoder.

Table 13: Parameters and efficiency comparison. Comparison of performance of methods based on latency, throughput, and number of trainable parameters. Tested on RTX3090 GPU, 32-bit precision float, and input image with size (480, 640). The last two rows correspond to the Camera and Depth Moudel evaluated independently. R18: ResNet-18 [22], D161: DenseNet-161 [24], EN-B5: EfficientNet-B5-AP [55], CNXT-L: ConvNext-L [33].
Method Backbone Latency (ms) Throughput (FPS) Parameters (M)
BTS [28] D161 28.5 35.1 47.0
Adains [3] EN-B5 33.2 30.1 78.3
NewCRF [61] SWin-L [32] 53.1 18.8 280.0
iDisc [41] SWin-L [32] 81.1 12.3 209.2
ZoeDepth [4] BEiT-L 144.8 6.91 345.9
ZeroDepth [21] R18 955.6 1.05 232.6
Metric3D [59] CNXT-L 40.3 24.8 203.2
UniDepth CNXT-L 86.6 11.5 238.9
UniDepth ViT-L [12] 146.4 6.83 347.0
Camera Module - 5.1 - 13.4
Depth Module - 49.2 - 26.6

E Network Architecture

Encoder. We show the effectiveness of our method with different encoders, both convolutional and transformer-based ones, e.g., ConvNext [33] and ViT [12]. However, all of them follow the same structure: the feature maps are extracted at each layer and the features map corresponding to a “scale” is obtained as the pixel-wise average. For ConvNext, we obtain the class tokens as the average pooled feature maps. All backbones utilized are originally designed for classification, thus we remove the last 3 layers, i.e., the pooling layer, fully connected layer, and softmaxsoftmax\mathrm{softmax}roman_softmax layer. The feature maps are flattened, then LayerNorm [2] (LN) and a linear layer are applied. The linear layer projects the features to a common channel dimension of 512. The projected feature maps are interpolated to a common shape, namely (h,w)=(H16,W16)𝑤𝐻16𝑊16(h,w)=(\frac{H}{16},\frac{W}{16})( italic_h , italic_w ) = ( divide start_ARG italic_H end_ARG start_ARG 16 end_ARG , divide start_ARG italic_W end_ARG start_ARG 16 end_ARG ), with H𝐻Hitalic_H, W𝑊Witalic_W as input height and width, respectively. Two independent projections are utilized for the features maps, i.e. 𝐅h×w×C×B𝐅superscript𝑤𝐶𝐵\mathbf{F}\in\mathbb{R}^{h\times w\times C\times B}bold_F ∈ blackboard_R start_POSTSUPERSCRIPT italic_h × italic_w × italic_C × italic_B end_POSTSUPERSCRIPT with B𝐵Bitalic_B corresponding to the four scales, and C𝐶Citalic_C set to 512 as mentioned above, and the class tokens C×Babsentsuperscript𝐶𝐵\in\mathbb{R}^{C\times B}∈ blackboard_R start_POSTSUPERSCRIPT italic_C × italic_B end_POSTSUPERSCRIPT, the latter fed to the Camera Module only.

Camera Module. The camera parameters are initialized with the four class tokens extracted from the Encoder. The flattened and stacked feature maps from the encoder are detached and used as keys and values in one cross-attention layer, where the queries correspond to the four camera parameters. The output is processed by a MultiLayer Perceptron (MLP) with one hidden layer with dimension of 2048 and non-linear activation Gaussian Error Linear Unit (GELU) [23]. The cross-attention and the MLP present a residual connection. The four tokens are further processed with two additional self-attention layers, projected to dimension one and then exponentiated. The camera parameters are obtained as fx=ΔfxW2subscript𝑓𝑥Δsubscript𝑓𝑥𝑊2f_{x}=\frac{\Delta f_{x}W}{2}italic_f start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT = divide start_ARG roman_Δ italic_f start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT italic_W end_ARG start_ARG 2 end_ARG, fy=ΔfyH2subscript𝑓𝑦Δsubscript𝑓𝑦𝐻2f_{y}=\frac{\Delta f_{y}H}{2}italic_f start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT = divide start_ARG roman_Δ italic_f start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT italic_H end_ARG start_ARG 2 end_ARG, cx=ΔcxW2subscript𝑐𝑥Δsubscript𝑐𝑥𝑊2c_{x}=\frac{\Delta c_{x}W}{2}italic_c start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT = divide start_ARG roman_Δ italic_c start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT italic_W end_ARG start_ARG 2 end_ARG, cy=ΔcyH2subscript𝑐𝑦Δsubscript𝑐𝑦𝐻2c_{y}=\frac{\Delta c_{y}H}{2}italic_c start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT = divide start_ARG roman_Δ italic_c start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT italic_H end_ARG start_ARG 2 end_ARG. The dense camera representation 𝐂𝐂\mathbf{C}bold_C is obtained by backprojecting with the predicted camera parameters: (𝐫x,𝐫y,𝐫z)=𝐊1[𝐮,𝐯,𝟏]Tsubscript𝐫𝑥subscript𝐫𝑦subscript𝐫𝑧superscript𝐊1superscript𝐮𝐯1𝑇(\mathbf{r}_{x},\mathbf{r}_{y},\mathbf{r}_{z})=\mathbf{K}^{-1}[\mathbf{u},% \mathbf{v},\mathbf{1}]^{T}( bold_r start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , bold_r start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT , bold_r start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ) = bold_K start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT [ bold_u , bold_v , bold_1 ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT and calculating the azimuth and elevation angles, θ𝜃\thetaitalic_θ and ϕitalic-ϕ\phiitalic_ϕ, as in Sec. B. The angular representation is embedded through the Laplace Spherical Harmonics Embedding (SHE) leading to 81 channels, resulting in 𝐄h×w×81𝐄superscript𝑤81\mathbf{E}\in\mathbb{R}^{h\times w\times 81}bold_E ∈ blackboard_R start_POSTSUPERSCRIPT italic_h × italic_w × 81 end_POSTSUPERSCRIPT.

Depth Module. The depth latents are initialized as the average of the features 𝐅𝐅\mathbf{F}bold_F along the B𝐵Bitalic_B dimension. Then, the latents are conditioned on the original feature tensor 𝐅𝐅\mathbf{F}bold_F via one cross-attention layer where two projections of 𝐅𝐅\mathbf{F}bold_F account for keys and values and 𝐋𝐋\mathbf{L}bold_L as queries. In addition, one MLP is applied, seamlessly as in the Camera Module. Furthermore, the depth features are conditioned on the camera prompts 𝐄𝐄\mathbf{E}bold_E with one additional cross-attention layer, where keys and values are two projections of camera embeddings 𝐄𝐄\mathbf{E}bold_E, and one MLP as above. The features are decoded in three consecutive stages. The first stage applies three self-attention layers with 𝐄𝐄\mathbf{E}bold_E as positional encoding. The features are then processed with one ConvNext [33] layer, upsampled by a factor of two, and the channels are halved. The second and third stages are similar, although the second stage presents two self-attention layers and the third only one. In the second and third stages, MLP’s hidden channel dimension is sequentially halved, too, from the initial aforementioned value of 2048. Each stage’s output is projected to a dimension one. Therefore, the three output maps are interpolated to a common shape, i.e. (H2𝐻2\frac{H}{2}divide start_ARG italic_H end_ARG start_ARG 2 end_ARG, W2𝑊2\frac{W}{2}divide start_ARG italic_W end_ARG start_ARG 2 end_ARG), and pixel-wise averaged. The final log-depth output 𝐙logsubscript𝐙\mathbf{Z}_{\log}bold_Z start_POSTSUBSCRIPT roman_log end_POSTSUBSCRIPT is obtained by upsampling the obtained tensor to the input shape (H𝐻Hitalic_H, W𝑊Witalic_W). The final depth is element-wise exponentiation of 𝐙logsubscript𝐙\mathbf{Z}_{\log}bold_Z start_POSTSUBSCRIPT roman_log end_POSTSUBSCRIPT.

F Visualization

We provide here twenty more qualitative comparisons, two for each zero-shot test set: KITTI, NYU, Diode, ETH3D in Fig. 5, DDAD, NuScenes, SUN-RGBD, IBims-1 in Fig. 6, and Fig. 7 displays VOID and HAMMER. The error maps are shown after applying median-based rescaling. The rescaling was deemed necessary to avoid some of the error maps being completely red and not informative. Due to sparsity, DDAD and Nuscenes GT and error maps are dilated by a factor of 5, leading to visible GT depth and error maps.

KITTI Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
NYU Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Diode Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
ETH3D Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
RGB & GT ZoeDepth [4] ZeroDepth [21] Metric3D [59] UniDepth Meters |||| A.Relformulae-sequenceARel\mathrm{A.Rel}roman_A . roman_Rel
Figure 5: Zero-shot qualitative results. Each pair of consecutive rows corresponds to one test sample. Each odd row shows the input RGB image and the absolute relative error map color-coded with coolwarm colormap. Each even row shows GT depth and the predicted depth. The last column represents the specific colormap ranges for depth and error. (†): KITTI and NYU in the training set.
DDAD Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
NuScenes Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
SUN-RGBD Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
IBims-1 Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
RGB & GT ZoeDepth [4] ZeroDepth [21] Metric3D [59] UniDepth Meters |||| A.Relformulae-sequenceARel\mathrm{A.Rel}roman_A . roman_Rel
Figure 6: Zero-shot qualitative results. Each pair of consecutive rows corresponds to one test sample. Each odd row shows the input RGB image and the absolute relative error map color-coded with coolwarm colormap. Each even row shows GT depth and the predicted depth. The last column represents the specific colormap ranges for depth and error. (†): DDAD in the training set.
VOID Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
HAMMER Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
RGB & GT ZoeDepth [4] ZeroDepth [21] Metric3D [59] UniDepth Meters |||| A.Relformulae-sequenceARel\mathrm{A.Rel}roman_A . roman_Rel
Figure 7: Zero-shot qualitative results. Each pair of consecutive rows corresponds to one test sample. Each odd row shows the input RGB image and the absolute relative error map color-coded with coolwarm colormap. Each even row shows GT depth and the predicted depth. The last column represents the specific colormap ranges for depth and error.