License: CC BY-NC-ND 4.0
arXiv:2312.05190v1 [cs.CV] 08 Dec 2023

Fine Dense Alignment of Image Bursts
through Camera Pose and Depth Estimation

Bruno Lecouat1,2,12{}^{1,2,\ast}start_FLOATSUPERSCRIPT 1 , 2 , ∗ end_FLOATSUPERSCRIPT      Yann Dubois de Mont-Marin2,2{}^{2,\ast}start_FLOATSUPERSCRIPT 2 , ∗ end_FLOATSUPERSCRIPT      Théo Bodrito2,2{}^{2,}start_FLOATSUPERSCRIPT 2 , end_FLOATSUPERSCRIPT
Julien Mairal44{}^{4}start_FLOATSUPERSCRIPT 4 end_FLOATSUPERSCRIPT      Jean Ponce2,323{}^{2,3}start_FLOATSUPERSCRIPT 2 , 3 end_FLOATSUPERSCRIPT
Authors marked with an asterisk (*) contributed equally to this work. This project started when B. Lecouat was a PhD student at Inria Paris.
Abstract

This paper introduces a novel approach to the fine alignment of images in a burst captured by a handheld camera. In contrast to traditional techniques that estimate two-dimensional transformations between frame pairs or rely on discrete correspondences, the proposed algorithm establishes dense correspondences by optimizing both the camera motion and surface depth and orientation at every pixel. This approach improves alignment, particularly in scenarios with parallax challenges. Extensive experiments with synthetic bursts featuring small and even tiny baselines demonstrate that it outperforms the best optical flow methods available today in this setting, without requiring any training. Beyond enhanced alignment, our method opens avenues for tasks beyond simple image restoration, such as depth estimation and 3D reconstruction, as supported by promising preliminary results. This positions our approach as a versatile tool for various burst image processing applications.

11footnotetext: Enhance Lab.22footnotetext: Inria and DIENS (ENS-PSL, CNRS, Inria).33footnotetext: Courant Institute and Center for Data Science, New York University.44footnotetext: Univ. Grenoble Alpes, Inria, CNRS, Grenoble INP, LJK.55footnotetext: Corresponding author: [email protected]

1 Introduction

This paper tackles the challenge of dense alignment in burst photography, a domain characterized by minimal camera movement and predominantly static scenes. We aim to align these image sequences accurately, quickly, and reliably.

Burst photography is increasingly pivotal in a range of image enhancement applications, as evidenced by recent advancements in high dynamic range imaging [23, 29], night photography [31], deblurring [14], or super-resolution [41, 28, 6]. In this context, a handheld camera captures a rapid sequence of images with slightly different viewpoints due to hand tremor, possibly with varying camera settings, over a brief duration. The alignment of these frames is a critical precursor for these methods. However, current approaches to image registration between image pairs, such as homography or optical flow estimation, do not fully leverage the nature of burst sequences (multiple views of a quasi-static three-dimensional scene with slight camera motion). This limitation potentially leads to suboptimal outcomes. Precision in alignment is crucial for the quality of the enhanced images, and inaccuracies can significantly impair the final results, introducing artifacts like ghosting or zip** [28].

In this paper, instead of relying on traditional pairwise dense alignment of frames, we propose a novel global estimation approach tailored for image bursts, which explicitly considers the three-dimensional nature of the scene. Specifically, our approach takes full advantage of the small baseline feature by introducing a new parametrization of optical flows, consistent across different views, based on the image formation model. This model assumes a perspective camera with known intrinsic parameters, capturing a static scene comprising surfaces approximated as small planar patches. Given the small baseline, we anticipate minimal occlusions between views. Consequently, we simplify the representation of the 3D scene into a concise two-dimensional grid that encodes the depth and normals of these planar surfaces.

Refer to caption
Figure 1: The global pipeline of our optimization-based method. It inputs a burst of images and an initialization depth map and outputs the direct and reverse flow between each image and the first one. Our method estimates the optical flow using the camera’s pose and 3D scene structure as optimization variables of the photometric reprojection errors in a reference frame then poses and depth maps can also be retrieved.

More precisely, our method employs a 2D grid to represent depth, normals from a reference view, and camera poses. While camera poses are optimized individually for each frame, structural parameters are shared across all views. This shared parameterization requires fewer parameters than traditional pairwise optical flow methods. It enhances the overall consistency and effectiveness of our alignment method while still preserving the expressivity necessary for accurately modeling motion induced from 3D scenes.

In practice, we solve a global optimization problem to align frames, minimizing patch photometric reprojection errors across all views within the reference frame. Optimizing for camera pose, depth, and normal parameters. In situations with parallax, our model adapts to determine camera motion and scene geometry that accounts for the relative movements between frames. When no parallax effects are present, the model defaults to fitting pose parameters for each frame with constant depth, similar to homography fitting.

To achieve efficient optimization, we propose a new coarse-to-fine block-coordinate descent algorithm inspired by the parametric Lucas-Kanade algorithm [5] in its structure, using a variant of the Gauss-Newton algorithm for precise pose optimization on SO(3)𝑆𝑂3SO(3)italic_S italic_O ( 3 ) and gradient descent for depth and normal adjustments. We also introduce a novel fixed-point algorithm to infer depth maps for new camera positions. This algorithm is particularly advantageous for our specific needs but also holds potential for broader applications. It enables us to calculate reverse optical flows and adapt reference views to other views, which is essential for downstream tasks like super-resolution and low-light photography and can also be used to detect occlusions.

We validate our approach with synthetic bursts built with photorealistic rendering software. To validate our approach with real-world data, we also demonstrate applications with real bursts captured with a Pixel 6 pro smartphone to night photography denoising and super-resolution. Quantitative and qualitative experiments with synthetic and real data show that our method consistently gives accurate registration results even when little or no parallax is present and consistently outperforms the state-of-the-art in the burst setting, outperforming learning-based methods for flow estimation such as RAFT [40].

Beyond flow estimation, our model demonstrates exceptional versatility and efficiency in small baseline scenarios. It not only achieves convergence in pose and depth to meaningful values but also surpasses specialized methods in these areas. Distinct from conventional 3D methods that typically separate camera pose estimation and dense reconstruction into different steps, our method directly tackles dense optimization within a joint estimation framework.

Essentially, our approach acts as a multifunctional tool in burst photography, with the dual capability to accurately estimate flow and precisely determine depth and pose. It is helpful across a wide range of downstream tasks and sets a new benchmark for processing small motion scenes—characterized by its simplicity, accuracy, and robustness.

Contributions. Below, we summarize our key contributions, highlighting how our approach serves as a versatile tool applicable to various burst image processing tasks:

  1. 1.

    State-of-the-art dense alignment for burst imagery: we propose a novel optimization algorithm that outperforms deep-learning methods in dense alignment. This precision is especially useful for tasks requiring fine alignments, like burst super-resolution.

  2. 2.

    Accurate pose and depth estimation in small motion: our algorithm provides state-of-the-art camera pose and depth estimation results in scenarios with minimal motion, effectively capturing 3D scene structures from bursts with small baselines. This performance is achieved where standard SFM methods such as COLMAP [37] struggle.

  3. 3.

    Novel fixed-point algorithm for depth inference: we propose a new fixed-point algorithm for deducing depth maps at novel camera positions, enhancing our method’s utility in reversing optical flows and war** reference views onto other views, with potential applications beyond the scope of this paper.

2 Related work

Burst photography.

Burst photography is a technique that involves capturing a sequence of images to improve the overall quality of a photograph by reducing noise [23], enhancing details [41, 33, 28, 6, 32, 7, 16], and improving dynamic range [29]. Traditionally, algorithms for burst photography rely on a registration step to align frames [15].

Recent advancements explore machine learning, specifically deep learning, for burst photography, often eliminating the need for traditional registration [32, 16]. However, many such algorithms are based on supervised learning, demanding paired datasets of degraded raw bursts and high-quality sRGB images for training. The reliance on simulated raw bursts generated from ground truth sRGB images introduces a potential mismatch between training and real-world data distributions. Real-world bursts may exhibit different degradations or involve a sensor mismatch, leading to artifacts [8]. Self-supervised learning methods [8, 34] have emerged to address these issues. Furthermore, the computational demands of deep learning models pose challenges for integration into embedded devices [15], limiting their practical utility under constraints of limited ressources.

In contrast, we present an efficient approach to image alignment specifically designed for burst image data. This approach does not rely on machine learning and can serve as a versatile tool in various burst processing applications, whether they involve learning-based components or not.

Multi-frame image registration.

A straightforward method for image alignment in burst photography involves aligning frames with a reference frame, as demonstrated in [41, 23]. Some works have explored the multi-view setting to enhance registration quality, such as [3, 19, 4], which introduced various optimization-based approaches for multi-view image registration. However, these approaches are limited to simple motion models, such as translations. In contrast, our method is more general and takes into account the three-dimensional nature of the scene.

Depth reconstruction from small motions. Popular 3D reconstruction methods rely on geometric approaches such as structure from motion (SfM) [37]. These methods use geometric constraints and depend on keypoint correspondences to reconstruct a sparse 3D scene. Subsequently, dense 3D representations can be estimated based on the sparse reconstruction, as done by Colmap [37]. Bundle adjustment is a critical step for refining the estimated 3D structure and camera poses of a scene. This process involves optimizing 2D image keypoints, corresponding 3D points, and camera calibration parameters iteratively to minimize the reprojection error, leading to a more accurate scene reconstruction.

Several 3D reconstruction methods have been specifically tailored for scenarios involving small motions to reconstruct depth maps. For instance, Im et al. [24] have adapted SfM to small motion settings, whereas [21] have proposed an efficient method using feature tracking for pairwise key points and bundle adjustment algorithms adapted to small motions. Additionally, this method estimates the intrinsic parameters of the camera as well as distortion parameters to achieve a better fit with the data. In a different approach, [12] introduces a neural depth model and uses an inertial measurement unit (IMU) and lidar measurements to respectively initialize camera poses and the depth map. Then, [11] eliminates the need to initialize with a depth map model, although initialization with such a model may still yield improved results.

In contrast, our method serves a different purpose than depth estimation, with our primary goal being accurate image alignment. As shown in Sec. 4, our dense depth estimation procedure is more suitable for this task than approaches based on bundle adjustment with sparse keypoints.

3 Method

The proposed method aims to robustly and accurately estimate the optical flow and its inverse between a reference image and other images within a burst sequence. Given the nature of a burst, where movements are small, this approach provides the opportunity to directly address the problem densely, in contrast to [21], which relies on prior sparse matching between pairs of views. Densely approaching the problem enables high flow accuracy compared to other existing methods. In order to address the problem both densely and robustly, the key idea of the method is to parameterize the flows for each view using a common dense map characterizing the scene in the reference view and the relative positions of the views with respect to the reference frame.

Our formation model is detailed in the first paragraph below, leading to the formulation of flow estimation by optimizing the dense structure map and the relative positions of the views. This optimization is achieved by minimizing the photometric reprojection error through the direct flow induced by the parameters, which is the loss that best characterizes the quality of the induced flow. The challenges of this minimization problem are outlined in the second paragraph. The minimization procedure uses a block coordinate descent between the dense structure map and the relative poses, described in the third paragraph. This approach stabilizes the optimization process.

It also enables a coarse-to-fine approach for the dense parameterization of the scene. Finally, our formation model also allows the calculation of inverse optical flow through a fixed-point algorithm, detailed in the fourth paragraph. The global pipeline is illustrated in Fig. 1.

Image formation model. We consider a rigid scene described by a piecewise surface and K+1𝐾1K+1italic_K + 1 internally calibrated pinhole cameras (Ck)k=0..K(C_{k})_{k=0..K}( italic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_k = 0 . . italic_K end_POSTSUBSCRIPT. A point 𝐮isubscript𝐮𝑖\mathbf{u}_{i}bold_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT in 𝐆𝐆\mathbf{G}bold_G a regular grid of the C0subscript𝐶0C_{0}italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT camera plane, is the projection of a point 𝐱isubscript𝐱𝑖\mathbf{x}_{i}bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT of the scene surface. We denote by πisubscript𝜋𝑖\pi_{i}italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, the affine plane tangent to the scene in 𝐱isubscript𝐱𝑖\mathbf{x}_{i}bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT parameterized, by its (non-unit) normal nisubscript𝑛𝑖n_{i}italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT such that πi={𝐲3,𝐧i𝐲=1}subscript𝜋𝑖formulae-sequence𝐲superscript3superscriptsubscript𝐧𝑖top𝐲1\pi_{i}=\{\mathbf{y}\in\mathbb{R}^{3},\mathbf{n}_{i}^{\top}\mathbf{y}=1\}italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = { bold_y ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT , bold_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_y = 1 }. A patch P(𝐮i)𝑃subscript𝐮𝑖P(\mathbf{u}_{i})italic_P ( bold_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) around 𝐮isubscript𝐮𝑖\mathbf{u}_{i}bold_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the projection of a patch around 𝐱isubscript𝐱𝑖\mathbf{x}_{i}bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT in πisubscript𝜋𝑖\pi_{i}italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, and its image in the camera plane Cksubscript𝐶𝑘C_{k}italic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is given by a homography uniquely defined by the plane πisubscript𝜋𝑖\pi_{i}italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and the extrinsic parameters 𝐑k,𝐭ksubscript𝐑𝑘subscript𝐭𝑘\mathbf{R}_{k},\mathbf{t}_{k}bold_R start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT of the other camera (Fig. 2). For 𝐮superscript𝐮\mathbf{u}^{\prime}bold_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT in the patch P(𝐮i)𝑃subscript𝐮𝑖P(\mathbf{u}_{i})italic_P ( bold_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ), we have the direct flow locally expressed as a homography:

H^i,k(𝐮)subscript^𝐻𝑖𝑘superscript𝐮\displaystyle\hat{H}_{i,k}(\mathbf{u}^{\prime})over^ start_ARG italic_H end_ARG start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT ( bold_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) =ψ(Hi,k[𝐮,1])absent𝜓subscript𝐻𝑖𝑘superscriptsuperscript𝐮1top\displaystyle=\psi(H_{i,k}[\mathbf{u}^{\prime},1]^{\top})= italic_ψ ( italic_H start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT [ bold_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , 1 ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) (1)
Hi,ksubscript𝐻𝑖𝑘\displaystyle H_{i,k}italic_H start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT =𝐑k+𝐭k𝐧i,absentsubscript𝐑𝑘subscript𝐭𝑘superscriptsubscript𝐧𝑖top\displaystyle=\mathbf{R}_{k}+\mathbf{t}_{k}\mathbf{n}_{i}^{\top},= bold_R start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + bold_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT bold_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT , (2)

where Hi,ksubscript𝐻𝑖𝑘H_{i,k}italic_H start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT is the homography matrix for the patch P(𝐮i)𝑃subscript𝐮𝑖P(\mathbf{u}_{i})italic_P ( bold_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) in the camera plane of Cksubscript𝐶𝑘C_{k}italic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, [𝐮,1]superscript𝐮1[\mathbf{u}^{\prime},1][ bold_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , 1 ] is the homogeneous representation of 𝐮superscript𝐮\mathbf{u}^{\prime}bold_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and ψ:x,y,zx/z,y/z\psi:x,y,z\rightarrow x/z,y/zitalic_ψ : italic_x , italic_y , italic_z → italic_x / italic_z , italic_y / italic_z is the standard projection. The parameters of this flow are the non-unit normal 𝐧isubscript𝐧𝑖\mathbf{n}_{i}bold_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT characterizing the plane πisubscript𝜋𝑖\pi_{i}italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and the pose 𝐑k,tksubscript𝐑𝑘subscript𝑡𝑘\mathbf{R}_{k},t_{k}bold_R start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT. It is important to note, as in [22, 11], that 𝐧isubscript𝐧𝑖\mathbf{n}_{i}bold_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is not a homogeneous vector defined up to scale and has three full degrees of freedom. The formation model is summed up in Fig. 2.

Refer to caption
Figure 2: Image formation model with a patch and its local homography flow.

Minimization problem. The parameters of our formation model is n=(𝐧i)iG𝑛subscriptsubscript𝐧𝑖𝑖𝐺n=(\mathbf{n}_{i})_{i\in G}italic_n = ( bold_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_i ∈ italic_G end_POSTSUBSCRIPT the dense map over a regular grid 𝐆𝐆\mathbf{G}bold_G parametrizing the scene structure and R,t=(𝐑k,tk)k=1..KR,t=(\mathbf{R}_{k},t_{k})_{k=1..K}italic_R , italic_t = ( bold_R start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_k = 1 . . italic_K end_POSTSUBSCRIPT the pose parameters of each Cksubscript𝐶𝑘C_{k}italic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT relative to C0subscript𝐶0C_{0}italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. As the objective is to estimate the optical flow between view k𝑘kitalic_k and the reference view, we optimize the parameters 𝐑,𝐭,𝐧𝐑𝐭𝐧\mathbf{R},\mathbf{t},\mathbf{n}bold_R , bold_t , bold_n so that the flows derived from local homographies Hi,ksubscript𝐻𝑖𝑘H_{i,k}italic_H start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT minimize the photometric reprojection error between the images I0subscript𝐼0I_{0}italic_I start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and Iksubscript𝐼𝑘I_{k}italic_I start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT. More specifically, we solve the minimization problem:

min𝐧,𝐑,𝐭12k=1Ki𝐆𝐮𝐏(𝐮i)ρ(|I0(𝐮)Ik(H^i,k(𝐮))|2),subscript𝐧𝐑𝐭12superscriptsubscript𝑘1𝐾subscript𝑖𝐆subscriptsuperscript𝐮𝐏subscript𝐮𝑖𝜌superscriptsubscript𝐼0superscript𝐮subscript𝐼𝑘subscript^𝐻𝑖𝑘superscript𝐮2\min_{\mathbf{n},\mathbf{R},\mathbf{t}}\frac{1}{2}\sum_{k=1}^{K}\sum_{i\in% \mathbf{G}}\sum_{\mathbf{u}^{\prime}\in\mathbf{P}(\mathbf{u}_{i})}\rho(|I_{0}(% \mathbf{u}^{\prime})-I_{k}(\hat{H}_{i,k}(\mathbf{u}^{\prime}))|^{2}),roman_min start_POSTSUBSCRIPT bold_n , bold_R , bold_t end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i ∈ bold_G end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT bold_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ bold_P ( bold_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT italic_ρ ( | italic_I start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( bold_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) - italic_I start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( over^ start_ARG italic_H end_ARG start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT ( bold_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) , (3)

where ρ𝜌\rhoitalic_ρ is a robust loss function as in [39]. Indeed, the formation model does not account for occlusion phenomena. When a pixel 𝐮superscript𝐮\mathbf{u}^{\prime}bold_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT in the C0subscript𝐶0C_{0}italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT plane is the projection of a point 𝐱𝐱\mathbf{x}bold_x that is not visible in camera Cksubscript𝐶𝑘C_{k}italic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, H^i,k(𝐮)subscript^𝐻𝑖𝑘superscript𝐮\hat{H}_{i,k}(\mathbf{u}^{\prime})over^ start_ARG italic_H end_ARG start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT ( bold_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) is essentially the projection of another point y𝑦yitalic_y that is not on the same scene element as x𝑥xitalic_x. Consequently, it is likely that the color Ik(H^i,k(𝐮))subscript𝐼𝑘subscript^𝐻𝑖𝑘superscript𝐮I_{k}(\hat{H}_{i,k}(\mathbf{u^{\prime}}))italic_I start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( over^ start_ARG italic_H end_ARG start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT ( bold_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) deviates significantly from I0(𝐮)subscript𝐼0𝐮I_{0}(\mathbf{u})italic_I start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( bold_u ). The function ρ𝜌\rhoitalic_ρ reduces the importance of large values, effectively filtering out such cases.

Optimization procedure. As usual in structure from motion litterature [22], there is a global scale ambiguity since for every λ>0𝜆0\lambda>0italic_λ > 0, jointly replacing 𝐭𝐭\mathbf{t}bold_t by λ𝐭𝜆𝐭\lambda\mathbf{t}italic_λ bold_t and 𝐧𝐧\mathbf{n}bold_n by 1/λ𝐧1𝜆𝐧1/\lambda\mathbf{n}1 / italic_λ bold_n does not change the homography matrices Hi,ksubscript𝐻𝑖𝑘H_{i,k}italic_H start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT and nor the loss. To prevent this ambiguity from hindering the convergence of the optimization procedure, our algorithm relies on a block coordinate descent that alternates between steps on the plane map 𝐧𝐧\mathbf{n}bold_n and steps on the relative poses 𝐑,𝐭𝐑𝐭\mathbf{R},\mathbf{t}bold_R , bold_t. Indeed, when 𝐧𝐧\mathbf{n}bold_n is fixed, there is no longer any ambiguity about the value that 𝐭𝐭\mathbf{t}bold_t can take, and the same applies to 𝐧𝐧\mathbf{n}bold_n when 𝐭𝐭\mathbf{t}bold_t is fixed. Gradually, the ambiguity boiles down to the scale induced by the parameters’ initialization. In addition, the optimization problem (3) is not convex and a good initialization is crucial to enable the algorithm’s convergence. In the case of small movements, it is reasonable to initialize the pose with 𝐑k=Isubscript𝐑𝑘𝐼\mathbf{R}_{k}=Ibold_R start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = italic_I and 𝐭k=0subscript𝐭𝑘0\mathbf{t}_{k}=0bold_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = 0. Therefore, it is necessary to have a good initialization of the plane parameters. Our method relies on an initialization based on a very coarse and low-resolution estimation of the scene depth in the reference image. Starting from a depth map 𝐳=(zi)i𝐆(0)𝐳subscriptsubscript𝑧𝑖𝑖superscript𝐆0\mathbf{z}=(z_{i})_{i\in\mathbf{G}^{(0)}}bold_z = ( italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_i ∈ bold_G start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT on a very low-resolution grid 𝐆(0)superscript𝐆0\mathbf{G}^{(0)}bold_G start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT (typically 16×16161616\times 1616 × 16), we can initialize the plane map as n(0)=([0,0,1/zi])i𝐆(0)superscript𝑛0subscript001subscript𝑧𝑖𝑖superscript𝐆0n^{(0)}=([0,0,1/z_{i}])_{i\in\mathbf{G}^{(0)}}italic_n start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT = ( [ 0 , 0 , 1 / italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] ) start_POSTSUBSCRIPT italic_i ∈ bold_G start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT. This corresponds to initializing the planes as fronto-parallel and located at a distance zisubscript𝑧𝑖z_{i}italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT from the reference camera. It is important to note that this initialization resolves the scale ambiguity and initializes in a good region, thereby avoiding certain local minima. However, it does not need to be extremely precise. As observed in Section 4, the performance of our method is minimally impacted by the quality of the initialization. This paper uses the smallest monocular network of shallow resolution from [36], which has negligible inference cost, to initialize the algorithm.

From this initialization, we adopt a coarse-to-fine strategy as in [30, 39] for optimizing the plan map. Specifically, we define a sequence 𝐆(0),,𝐆(L)superscript𝐆0superscript𝐆𝐿\mathbf{G}^{(0)},\ldots,\mathbf{G}^{(L)}bold_G start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT , … , bold_G start_POSTSUPERSCRIPT ( italic_L ) end_POSTSUPERSCRIPT of L𝐿Litalic_L regular grids, each twice as fine as the previous one, with 𝐆(L)superscript𝐆𝐿\mathbf{G}^{(L)}bold_G start_POSTSUPERSCRIPT ( italic_L ) end_POSTSUPERSCRIPT having the same resolution as the burst (Ik)k=0..K(I_{k})_{k=0..K}( italic_I start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_k = 0 . . italic_K end_POSTSUBSCRIPT. We also denote Ik(l)superscriptsubscript𝐼𝑘𝑙I_{k}^{(l)}italic_I start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT as the downsampled version of the image Iksubscript𝐼𝑘I_{k}italic_I start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT to the resolution of 𝐆(l)superscript𝐆𝑙\mathbf{G}^{(l)}bold_G start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT. Our optimization strategy is as follows:

  • We perform the steps for poses 𝐑,𝐭𝐑𝐭\mathbf{R},\mathbf{t}bold_R , bold_t using the high-resolution grid 𝐆(L)superscript𝐆𝐿\mathbf{G}^{(L)}bold_G start_POSTSUPERSCRIPT ( italic_L ) end_POSTSUPERSCRIPT, a linear interpolation of the current estimate of the plane map 𝐧(l)superscript𝐧𝑙\mathbf{n}^{(l)}bold_n start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT to the resolution of 𝐆(L)superscript𝐆𝐿\mathbf{G}^{(L)}bold_G start_POSTSUPERSCRIPT ( italic_L ) end_POSTSUPERSCRIPT, and using the high-resolution images (Ik)k=0..K(I_{k})_{k=0..K}( italic_I start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_k = 0 . . italic_K end_POSTSUBSCRIPT. For these steps, we employ a proximal Gauss-Newton algorithm tailored to the fact that rotation matrices belong to the Lie group SO(3)𝑆𝑂3SO(3)italic_S italic_O ( 3 ) and the minimization problem (3) is a robust nonlinear least squares problem [25]. The small number of variables (six times the number of images) makes the computation of the required Jacobians tractable. Details about the Gauss-Newton step and the closed form of the Jacobians are provided in Appendix A. Using the Lie group exponential representation of rotation and employing a second-order optimization method are crucial elements of our method for achieving high precision. We empirically show the advantages of these choices in the ablation study presented in Appendix E.

  • We perform the steps on the plane map parameters 𝐧(l)superscript𝐧𝑙\mathbf{n}^{(l)}bold_n start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT at scale l𝑙litalic_l using the gradient descent variation Adam [26], with the loss calculated using the grid 𝐆(l)superscript𝐆𝑙\mathbf{G}^{(l)}bold_G start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT, 𝐑,𝐭𝐑𝐭\mathbf{R},\mathbf{t}bold_R , bold_t, and the images at resolution l𝑙litalic_l: (Ik(l))k=0..K(I_{k}^{(l)})_{k=0..K}( italic_I start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_k = 0 . . italic_K end_POSTSUBSCRIPT. Using a method with moments like Adam accelerates the convergence of the procedure.

  • Every few alternate steps on 𝐑,𝐭𝐑𝐭\mathbf{R},\mathbf{t}bold_R , bold_t on one side and 𝐧(l)superscript𝐧𝑙\mathbf{n}^{(l)}bold_n start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT on the other side, we double the resolution of the plane map 𝐧(l)superscript𝐧𝑙\mathbf{n}^{(l)}bold_n start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT and move to the next scale with 𝐧(l+1)superscript𝐧𝑙1\mathbf{n}^{(l+1)}bold_n start_POSTSUPERSCRIPT ( italic_l + 1 ) end_POSTSUPERSCRIPT.

The procedure is summarized in the pseudocode in Algorithm 1. In the case of a dense approach, a coarse-to-fine strategy is crucial. Since we use a photometric loss, the gradients and Jacobians depend on the spatial gradients of the Iksubscript𝐼𝑘I_{k}italic_I start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT images and contain only sub-pixel information. When the alignment error is larger than one pixel, this can lead to convergence issues. At lower scales of the coarse-to-fine approach, pixels cover a larger area, allowing us to benefit from the information. As we move to higher scales, we increase the precision we aim to achieve. Finally, note that the original minimization problem is properly solved during the last stage of the coarse-to-fine approach. The previous stages can be interpreted as a procedure to generate the right initialization for the original minimization problem.

Algorithm 1 Multiscale block coordinate descent
1:L0𝐿0L\geq 0italic_L ≥ 0, N0,β𝑁0𝛽N\geq 0,\betaitalic_N ≥ 0 , italic_β
2:I=(Ik)k=0..K,dI=(I_{k})_{k=0..K},ditalic_I = ( italic_I start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_k = 0 . . italic_K end_POSTSUBSCRIPT , italic_d \triangleright Burst and 16×16161616\times 1616 × 16 resolution depth map
3:dMono(I0)𝑑Monosubscript𝐼0d\leftarrow\operatorname{Mono}(I_{0})italic_d ← roman_Mono ( italic_I start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) \triangleright Low resolution monocular depth estimation
4:n[0,0,1/d]𝑛001𝑑n\leftarrow[0,0,1/d]italic_n ← [ 0 , 0 , 1 / italic_d ] \triangleright dimension 16×16161616\times 1616 × 16
5:RI𝑅𝐼R\leftarrow Iitalic_R ← italic_I \triangleright K𝐾Kitalic_K matrices 3×3333\times 33 × 3
6:t0𝑡0t\leftarrow 0italic_t ← 0 \triangleright K𝐾Kitalic_K vectors of size 3333
7:l0𝑙0l\leftarrow 0italic_l ← 0
8:while lL𝑙𝐿l\leq Litalic_l ≤ italic_L do \triangleright Multiscale loop
9:    ninterpole(n,2×resolution(n))𝑛interpole𝑛2resolution𝑛n\leftarrow\operatorname{interpole}(n,2\times\operatorname{resolution}(n))italic_n ← roman_interpole ( italic_n , 2 × roman_resolution ( italic_n ) ) \triangleright Double resolution
10:    Isample(I,resolution(n))subscript𝐼sample𝐼resolution𝑛I_{-}\leftarrow\operatorname{sample}(I,\operatorname{resolution}(n))italic_I start_POSTSUBSCRIPT - end_POSTSUBSCRIPT ← roman_sample ( italic_I , roman_resolution ( italic_n ) ) \triangleright Low resolution image
11:    m0𝑚0m\leftarrow 0italic_m ← 0
12:    while mM𝑚𝑀m\leq Mitalic_m ≤ italic_M do \triangleright Block descent
13:         n+interpole(n,resolution(I))subscript𝑛interpole𝑛resolution𝐼n_{+}\leftarrow\operatorname{interpole}(n,\operatorname{resolution}(I))italic_n start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ← roman_interpole ( italic_n , roman_resolution ( italic_I ) ) \triangleright High resolution
14:         R,tPGN(R,t,n+,I)𝑅𝑡PGN𝑅𝑡subscript𝑛𝐼R,t\leftarrow\operatorname{PGN}(R,t,n_{+},I)italic_R , italic_t ← roman_PGN ( italic_R , italic_t , italic_n start_POSTSUBSCRIPT + end_POSTSUBSCRIPT , italic_I )\triangleright Pose Newton step
15:         nADAM(R,t,n,I)𝑛ADAM𝑅𝑡𝑛subscript𝐼n\leftarrow\operatorname{ADAM}(R,t,n,I_{-})italic_n ← roman_ADAM ( italic_R , italic_t , italic_n , italic_I start_POSTSUBSCRIPT - end_POSTSUBSCRIPT )\triangleright Some steps with ADAM
16:         mm+1𝑚𝑚1m\leftarrow m+1italic_m ← italic_m + 1
17:    end while
18:    ll+1𝑙𝑙1l\leftarrow l+1italic_l ← italic_l + 1
19:end while
20:return R,t,n𝑅𝑡𝑛R,t,nitalic_R , italic_t , italic_n

Outputting the flows, poses, depth map and normal map.

Refer to caption
Figure 3: We have the depth map (zi(0))i𝐆(L)subscriptsubscriptsuperscript𝑧0𝑖𝑖superscript𝐆𝐿(z^{(0)}_{i})_{i\in\mathbf{G}^{(L)}}( italic_z start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_i ∈ bold_G start_POSTSUPERSCRIPT ( italic_L ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT in the reference view (C0)subscript𝐶0(C_{0})( italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ), with interpolation we construct Z𝑍Zitalic_Z that gives the depth for any 𝐮0subscript𝐮0\mathbf{u}_{0}bold_u start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. We initialize a depth map (zi(k))i𝐆(L)subscriptsubscriptsuperscript𝑧𝑘𝑖𝑖superscript𝐆𝐿(z^{(k)}_{i})_{i\in\mathbf{G}^{(L)}}( italic_z start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_i ∈ bold_G start_POSTSUPERSCRIPT ( italic_L ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT in view k𝑘kitalic_k with a copy of the depth map in view 00. Then, we use z(k)superscript𝑧𝑘z^{(k)}italic_z start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT on the regular grid to induce a direct flow into the view 00 and query the depth Z(F(𝐮i(k)))𝑍𝐹superscriptsubscript𝐮𝑖𝑘Z(F(\mathbf{u}_{i}^{(k)}))italic_Z ( italic_F ( bold_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ) ). Reprojecting the obtained depth map gives the new iterate of z(k)superscript𝑧𝑘z^{(k)}italic_z start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT. At convergence, the direct flow induced by z(k)superscript𝑧𝑘z^{(k)}italic_z start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT is the reverse flow from view 00 to view k𝑘kitalic_k.

After the convergence of the algorithm, we obtain 𝐑,𝐭𝐑𝐭\mathbf{R},\mathbf{t}bold_R , bold_t, and 𝐧(L)superscript𝐧𝐿\mathbf{n}^{(L)}bold_n start_POSTSUPERSCRIPT ( italic_L ) end_POSTSUPERSCRIPT that minimize the problem (3) on a grid of maximum resolution, thus minimizing the photometric error of the flow. We then obtain an estimation of the direct flow for each image. For 𝐮i(0)𝐆(L)superscriptsubscript𝐮𝑖0superscript𝐆𝐿\mathbf{u}_{i}^{(0)}\in\mathbf{G}^{(L)}bold_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ∈ bold_G start_POSTSUPERSCRIPT ( italic_L ) end_POSTSUPERSCRIPT:

𝐮i(k)=H^i,k(𝐮i(0))=ψ(𝐑k𝐮i(0)+1/zi(0)𝐭k),superscriptsubscript𝐮𝑖𝑘subscript^𝐻𝑖𝑘superscriptsubscript𝐮𝑖0𝜓subscript𝐑𝑘superscriptsubscript𝐮𝑖01superscriptsubscript𝑧𝑖0subscript𝐭𝑘\mathbf{u}_{i}^{(k)}=\hat{H}_{i,k}(\mathbf{u}_{i}^{(0)})=\psi(\mathbf{R}_{k}% \mathbf{u}_{i}^{(0)}+1/z_{i}^{(0)}\mathbf{t}_{k}),bold_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT = over^ start_ARG italic_H end_ARG start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT ( bold_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ) = italic_ψ ( bold_R start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT bold_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT + 1 / italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT bold_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) , (4)

where, from 𝐧(L)superscript𝐧𝐿\mathbf{n}^{(L)}bold_n start_POSTSUPERSCRIPT ( italic_L ) end_POSTSUPERSCRIPT, we recover a high-resolution depth map in the reference view: zi(0)=1/([𝐮i(0),1]𝐧i(L))superscriptsubscript𝑧𝑖01superscriptsuperscriptsubscript𝐮𝑖01topsuperscriptsubscript𝐧𝑖𝐿z_{i}^{(0)}=1/([\mathbf{u}_{i}^{(0)},1]^{\top}\mathbf{n}_{i}^{(L)})italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT = 1 / ( [ bold_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT , 1 ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_L ) end_POSTSUPERSCRIPT ). Normalizing 𝐧(L)superscript𝐧𝐿\mathbf{n}^{(L)}bold_n start_POSTSUPERSCRIPT ( italic_L ) end_POSTSUPERSCRIPT also provides a high-resolution normal map. Finally, our algorithm directly estimates the camera poses 𝐑,𝐭𝐑𝐭\mathbf{R},\mathbf{t}bold_R , bold_t. For certain applications, such as super-resolution, we need the inverse flow, i.e., 𝐮i(0)=F(𝐮i(k))superscriptsubscript𝐮𝑖0𝐹superscriptsubscript𝐮𝑖𝑘\mathbf{u}_{i}^{(0)}=F(\mathbf{u}_{i}^{(k)})bold_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT = italic_F ( bold_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ). The flow inverse is generally unstable, so PyTorch [35] does not implement the forward warp. As the movements are small for a burst, the depth map in view Cksubscript𝐶𝑘C_{k}italic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT will be close to that in view C0subscript𝐶0C_{0}italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. Moreover, the depth map in view Cksubscript𝐶𝑘C_{k}italic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT allows generating a direct flow 𝐮i(0)=F(𝐮i(k))superscriptsubscript𝐮𝑖0𝐹superscriptsubscript𝐮𝑖𝑘\mathbf{u}_{i}^{(0)}=F(\mathbf{u}_{i}^{(k)})bold_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT = italic_F ( bold_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ) as in (4) using the inverse poses: [𝐑k,𝐑k𝐭k]superscriptsubscript𝐑𝑘topsuperscriptsubscript𝐑𝑘topsubscript𝐭𝑘[\mathbf{R}_{k}^{\top},-\mathbf{R}_{k}^{\top}\mathbf{t}_{k}][ bold_R start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT , - bold_R start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ]. So, if the depth map is correct, we should obtain the identity by composing with the initial direct flow. This allows designing a fixed-point algorithm presented in Fig. 3 and detailed in Appendix B. Pixels for which the fixed point does not converge correspond to the occluded pixel, and the occlusion masks are presented in Appendix G.

Method
EPE
\downarrow
RMSE
\downarrow
NPE1
\uparrow
NPE2
\uparrow
NPE3
\uparrow
EPE
\downarrow
RMSE
\downarrow
NPE1
\uparrow
NPE2
\uparrow
NPE3
\uparrow
Blender 1 (small motion) Blender 2 (micro motion)
DfUSMC [21] * 1.4466 2.1723 0.5315 0.7488 0.8477 4.1356 4.5676 0.2267 0.4278 0.5497
RCVD [27]* 5.9556 7.678 0.0957 0.2534 0.3763 0.4007 0.5316 0.8676 0.9825 0.9959
Saop [11] * 9.7262 12.5891 0.101 0.2457 0.3402 2.0430 2.3563 0.5684 0.7645 0.8424
Homography 2.8102 4.7107 0.4998 0.6627 0.7405 0.3008 0.3772 0.9003 0.9921 0.9982
Farnebäck [17] 2.6852 4.8478 0.5299 0.6612 0.7278 2.0892 3.8154 0.6480 0.7296 0.7642
RAFT [40] 0.9013 1.5396 0.7348 0.9069 0.9443 0.4857 0.5765 0.8664 0.9857 0.9963
Ours 0.7439 1.4324 0.7841 0.9084 0.9456 0.2321 0.2820 0.9366 0.9972 1.0000
Table 1: Optical flow errors. The optical flow was predicted from the extrinsic camera parameters and depth maps for the models marked with an asterisk.

4 Experiments

Pose Depth
Method
Left l2
(m)\downarrow
ATE
(m) \downarrow
Geom
(m) \downarrow
Biinvrot l2
(deg) \downarrow
Abs rel \downarrow Sqr rel \downarrow RMSE\downarrow Delta 1\uparrow Delta 2 \uparrow Delta 3 \uparrow
Dataset Blender 1 (small motion)
Colmap [37]
DfUSMC[21] 0.0117 0.0108 0.0094 0.1948 0.2107 0.4864 0.9683 0.7723 0.8877 0.9409
Saop [11] 0.0274 0.0229 0.0204 0.6369 0.5818 1.8768 1.7900 0.3958 0.6009 0.7198
RCVD [27] 0.0168 0.0162 0.0140 0.2158 0.3111 0.5382 1.2368 0.5294 0.814 0.9524
Ours 0.0066 0.0056 0.0050 0.1806 0.1381 0.2391 0.8688 0.8358 0.9263 0.9761
Dataset Blender 2 (micro motion)
Colmap [37]
DfUSMC[21] 0.0046 0.0026 0.0024 0.1918 0.3093 0.9543 2.0499 0.5722 0.7785 0.9187
Saop [11] 0.0078 0.0043 0.0040 0.2678 0.2936 0.8326 2.0020 0.5794 0.7976 0.9263
RCVD [27] 0.0168 0.0162 0.0140 0.2158 0.1898 0.3492 1.3745 0.6726 0.8816 0.9693
Ours 0.0022 0.0022 0.0020 0.0245 0.1383 0.1962 1.1521 0.7996 0.9819 0.9983
Table 2: Pose and depth errors metrics on the two proposed synthetic bursts datasets.
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption Refer to caption
Ref. image Noisy Homography Farneback [17] Ours
Figure 4: Burst denoising for night photography on real bursts exploiting alignment of various algorithms. Left: Full image with bounding boxes highlighting the region of interest. Top line: background region is misaligned for concurrent methods. Bottom line: The cup is misaligned for other methods. Homography misaligned the plant as well. It is best seen by zooming on a computer screen.
Refer to caption Refer to caption Refer to caption Refer to caption
Ref. image Groundtruth Midas [36] RCVD [27]
Refer to caption Refer to caption Refer to caption Refer to caption
Saop [11] DfUSMC [21] Ours Ours + reg
Figure 5: Depth estimation from a synthetic image burst. It is one of the scenes generated with Blender used in the dataset Blender 2. We present our result w/o regularisation (Ours) and with determinant penalization (Ours + reg) for smoother results, see the Appendix C.

We conduct experiments on synthetic bursts and showcase practical applications using real bursts captured with a Pixel 6 Pro smartphone. These applications include night photography and 3D reconstruction, serving as proof of concept. Additionally, we have included preliminary experiments on burst super-resolution in Appendix K.

Synthetic burst simulation. We require photorealistic bursts containing ground truth depth and camera poses for evaluating our approach and concurrent methods, but existing public multi-view stereo datasets we are aware of lack the needed characteristics due to non-static scenes or excessively large frame baselines that do not align with our specific use cases.

We generated two photorealistic synthetic datasets using CYCLES, the path tracing engine of Blender [13]. We used a set of twelve publicly available indoor scenes made by 3D artists, with detailed and varied scene compositions.

Ten scenes come from [1], and two scenes are from [2]. Each burst of the dataset consists of 20 frames, with a resolution of 512x512 pixels.

We skipped the post-processing denoising step at the end of the rendering to avoid temporal flickering artifacts and mitigated render noise by using a large number of samples (4096). The camera trajectories and orientations are crafted as follows: a few keyframes was positioned manually to outline the global path, and the other keyframes were obtained with Bezier interpolation.

We generated two datasets: Blender 1 with small baselines and Blender 2 with micro-baselines. The first dataset exhibits larger parallax effects, while the second dataset has reduced parallax effects. Detailed characteristics of these datasets are provided in the Appendix D.

Evaluation on synthetic data. We initialize our algorithm on synthetic data with a 16×16161616\times 1616 × 16 coarse depth map using the shallow network from [36]. For the evaluation, we follow the standard practice to evaluate pose, depth, and flow, described in [27, 20]. For all the methods, as depth estimation and pose are known up to an unknown scale, we align the predicted depth and the ground truths using median scaling. For pose evaluation, we compute the scale factor as s=argminsTsT^2𝑠subscriptargmin𝑠superscriptnorm𝑇𝑠^𝑇2s=\operatorname*{arg\,min}_{s}\|T-s\hat{T}\|^{2}italic_s = start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ∥ italic_T - italic_s over^ start_ARG italic_T end_ARG ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, where T=[𝐭0,,𝐭N]𝑇matrixsubscript𝐭0subscript𝐭𝑁T=\begin{bmatrix}\mathbf{t}_{0},\cdots,\mathbf{t}_{N}\end{bmatrix}italic_T = [ start_ARG start_ROW start_CELL bold_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , ⋯ , bold_t start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ]. In addition, we use the canonic left-invariant distance in SE(3)𝑆𝐸3SE(3)italic_S italic_E ( 3 ) that combines rotational and translation parts in one quantity; see [10, 42] for details. We report the distance between the ground truth pose and the estimated pose. It reads d([R,t],[R,t])2=tt22+λlog(RR)22𝑑superscript𝑅𝑡superscript𝑅superscript𝑡2superscriptsubscriptnormsuperscript𝑡𝑡22𝜆superscriptsubscriptnormsuperscript𝑅topsuperscript𝑅22d([R,t],[R^{\prime},t^{\prime}])^{2}=\|t^{\prime}-t\|_{2}^{2}+\lambda\|\log(R^% {\top}R^{\prime})\|_{2}^{2}italic_d ( [ italic_R , italic_t ] , [ italic_R start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ] ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = ∥ italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - italic_t ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_λ ∥ roman_log ( italic_R start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_R start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. For λ𝜆\lambdaitalic_λ, we use the median value of the ground truth depth. log(RR)2subscriptnormsuperscript𝑅topsuperscript𝑅2\|\log(R^{\top}R^{\prime})\|_{2}∥ roman_log ( italic_R start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_R start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is the canonic metric on the set of rotation SO(3)𝑆𝑂3SO(3)italic_S italic_O ( 3 ) and is also reported independently. Unlike other methods in the literature [27], we choose not to present relative pose error (RPE) as a good RPE may not correlate with good alignment metrics and rely on a time coherent burst. To evaluate the ATE, we did not align the estimated poses with the ground truth poses with rigid transformation, as is common in the SLAM community. Indeed, our loss 3 and, more generally, the flow is not invariant by a solid transformation of the poses. As the final goal of our method is alignment, performance evaluation up to a rigid transformation would not be informative.

For optical flow evaluation, we conducted comparisons on our synthetic datasets. We utilized a state-of-the-art deep optical flow method [40] by registering all frames pairwise with a reference. Additionally, we employed a standard homography and the Farneback optical flow [17] for comparison. Furthermore, we computed optical flow errors for other concurrent methods [11, 21, 27] using the camera projection model as in Eq. (4) and their estimated pose and depth maps. Leveraging the assumption of a static scene, our method consistently outperformed [40] regarding flow accuracy.

We conducted comparisons of our pose and depth estimation method with methods introduced in [27], [11], and [21], utilizing publicly available codebases. To ensure a fair comparison, we initialized the method from [11] with the same depth map as the one we used for our own initialization.

We compare our method with a monocular depth estimation model Midas [36]. However, monocular methods estimate depth up to an affine transformation, whereas flow estimation is not invariant by affine reparameterization. Using affine registration lacks full relevance to evaluate the quality of the result, so the performances in Table 2 are obtained after rescaling only. For a fair depth map comparison, we also evaluate our method and others against Midas with an affine registration. Results are presented in Appendix F.

3D reconstructions quality on synthetic data and real bursts.

We evaluate qualitatively our depth reconstructions on synthetic data from our dataset and real bursts captured with a Pixel 6 Pro smartphone.

Visualizations of reconstructed depth maps are provided in Figure 5. Our depth map can have a noisy aspect on a texture-less structure. This is a normal feature as our optimization is not well conditioned on uniform surfaces, as small variations in inferred depth will not affect the reprojection photometric loss. This noisy effect can be mitigated by adding spatial regularization for the scene steps. But this trades with lower performance in terms of flow and pose metrics on synthetic data. We observed that no spatial regularization plan parameters give the best image alignment and pose estimation results. We detail the spatial regularization in Appendix C.

For real scenes, we showcase the high-quality depth reconstructions achievable with our method in Figure 7. We input RAW image bursts from the Pixel 6 Pro smartphone, and perform demosaicking using bilinear filtering. We initialize our algorithm with a low-resolution depth map from the phone sensor. We compare our results with depth maps obtained from a monocular method [36], RCVD [27], Saop [11], and DfUSMC [21]. Furthermore, we provide visualizations of reconstructed point clouds in Figure 6.

Refer to caption
Figure 6: Top : point cloud reconstruction with DfUSMC [21]. Bottom : point cloud reconstructed with our method. We show respectively left, right, and top views of the two point clouds.

Low-light photography on real bursts.

To demonstrate the robustness and accuracy of our alignment method for downstream tasks, we conducted a low-light photography experiment as a proof of concept. This scenario is challenging as it involves aligning frames with a low signal-to-noise ratio. We captured night bursts using a Pixel 6 Pro smartphone under low light conditions, using a short exposure time and high ISO settings to reduce motion blur. We aligned these frames using our method and other concurrent alignment algorithms, including a simple homography and dense optical flow using the Farneback implementation from OpenCV [9].

To reduce noise, we averaged the aligned frames, using a straightforward denoising approach. While our focus was on highlighting the registration quality of our method, it’s worth noting that a more sophisticated fusion algorithm could be employed to enhance image quality and reduce artifacts, as seen in previous works [31, 23].

In Figure 4, we provide visual comparisons of our results. We observed that due to the nonplanar nature of the scene, the homography-based approach failed to align objects in the foreground and background, resulting in a blurry appearance in the denoised image. In contrast, the optical flow model exhibited greater flexibility, successfully aligning objects in both the foreground and background. However, some elements, such as the white book in the background or certain patterns on the white cup in the foreground, were still not perfectly aligned.

Refer to caption Refer to caption Refer to caption
Ref. image Midas [36] RCVD [27]
Refer to caption Refer to caption Refer to caption
DfUSMC [21] Ours Ours+reg
Figure 7: Depth estimation from a real burst. We present our result w/o regularisation (Ours) and with determinant penalization (Ours + reg) for smoother results; see Appendix C for more details.

Depth initialization. Figure 8 shows the impact of the depth map’s initialization on our method’s performance. We gradually increase the variance of a Gaussian random noise added to the 16×16161616\times 1616 × 16 initialization depth map and evaluate the performance of our algorithm on our synthetic dataset with various depth, pose, and alignment metrics. This experiment demonstrates that our method is robust to noise on the initialization depth map. The model only requires a noisy estimate to converge to the right solution.

Refer to caption
Figure 8: Noise on the initialization depth map. Our method is robust to noise; the performance degrades when the noise’s variance is larger than 1 meter.

5 Conclusion

Our approach offers a comprehensive and versatile solution for burst photography. It excels in accurately estimating flow, depth, and pose, setting a new benchmark for processing small motion scenes. Future enhancements include integrating intrinsic camera parameter estimation like focal length to improve accuracy, refining our model for lens-induced distortions, and exploring more advanced camera models such as thin-lens to account for defocus effects.

Acknowledgments

This work was funded in part by the French government under management of Agence Nationale de la Recherche as part of the “Investissements d’avenir” program, reference ANR-19-P3IA-0001 (PRAIRIE 3IA Institute). JM was supported by the ERC grant number 101087696 (APHELAIA project) and by ANR 3IA MIAI@Grenoble Alpes (ANR-19-P3IA-0003). JP was partly supported by the Louis Vuitton/ENS chair in artificial intelligence and a Global Distinguished Professor appointment at the Courant Institute of Mathematical Sciences and the Center for Data Science of New York University.

References

  • arc [a] Evermotion Archinteriors vol.43. https://evermotion.org/shop/show_product/archinteriors-vol-43/12555, a.
  • arc [b] Architecture topics. https://www.youtube.com/watch?v=Gn1biEB5PbQ, b.
  • Aguerrebere et al. [2016] Cecilia Aguerrebere, Mauricio Delbracio, Alberto Bartesaghi, and Guillermo Sapiro. Fundamental limits in multi-image alignment. IEEE Transactions on Signal Processing, 64(21):5707–5722, 2016.
  • Aguerrebere et al. [2018] Cecilia Aguerrebere, Mauricio Delbracio, Alberto Bartesaghi, and Guillermo Sapiro. A practical guide to multi-image alignment. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1927–1931. IEEE, 2018.
  • Baker and Matthews [2004] Simon Baker and Iain Matthews. Lucas-kanade 20 years on: A unifying framework. International journal of computer vision, 56:221–255, 2004.
  • Bhat et al. [2021a] Goutam Bhat, Martin Danelljan, Luc Van Gool, and Radu Timofte. Deep burst super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9209–9218, 2021a.
  • Bhat et al. [2021b] Goutam Bhat, Martin Danelljan, Fisher Yu, Luc Van Gool, and Radu Timofte. Deep reparametrization of multi-frame super-resolution and denoising. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2460–2470, 2021b.
  • Bhat et al. [2023] Goutam Bhat, Michaël Gharbi, Jiawen Chen, Luc Van Gool, and Zhihao Xia. Self-supervised burst super-resolution. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10605–10614, 2023.
  • Bradski [2000] Gary Bradski. The opencv library. Dr. Dobb’s Journal: Software Tools for the Professional Programmer, 25(11):120–123, 2000.
  • Chirikjian [2009] Gregory S. Chirikjian. Stochastic Models, Information Theory, and Lie Groups, Volume 1. Birkhäuser Boston, 2009.
  • Chugunov et al. [2022a] Ilya Chugunov, Yuxuan Zhang, and Felix Heide. Shakes on a plane: Unsupervised depth estimation from unstabilized photography. arXiv preprint arXiv:2212.12324, 2022a.
  • Chugunov et al. [2022b] Ilya Chugunov, Yuxuan Zhang, Zhihao Xia, Xuaner Zhang, Jiawen Chen, and Felix Heide. The implicit values of a good hand shake: Handheld multi-frame neural depth refinement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2852–2862, 2022b.
  • Community [2018] Blender Online Community. Blender - a 3D modelling and rendering package. Blender Foundation, Stichting Blender Foundation, Amsterdam, 2018.
  • Delbracio and Sapiro [2015] Mauricio Delbracio and Guillermo Sapiro. Burst deblurring: Removing camera shake through fourier burst accumulation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2385–2393, 2015.
  • Delbracio et al. [2021] Mauricio Delbracio, Damien Kelly, Michael S Brown, and Peyman Milanfar. Mobile computational photography: A tour. Annual Review of Vision Science, 7:571–604, 2021.
  • Dudhane et al. [2023] Akshay Dudhane, Syed Waqas Zamir, Salman Khan, Fahad Shahbaz Khan, and Ming-Hsuan Yang. Burstormer: Burst image restoration and enhancement transformer. 2023.
  • Farnebäck [2003] Gunnar Farnebäck. Two-frame motion estimation based on polynomial expansion. In Image Analysis: 13th Scandinavian Conference, SCIA 2003 Halmstad, Sweden, June 29–July 2, 2003 Proceedings 13, pages 363–370. Springer, 2003.
  • Farsiu et al. [2004] Sina Farsiu, M Dirk Robinson, Michael Elad, and Peyman Milanfar. Fast and robust multiframe super resolution. IEEE transactions on image processing, 13(10):1327–1344, 2004.
  • Farsiu et al. [2005] Sina Farsiu, Michael Elad, and Peyman Milanfar. Constrained, globally optimal, multi-frame motion estimation. In IEEE/SP 13th Workshop on Statistical Signal Processing, 2005, pages 1396–1401. IEEE, 2005.
  • Geiger et al. [2012] Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we ready for autonomous driving? the kitti vision benchmark suite. In 2012 IEEE conference on computer vision and pattern recognition, pages 3354–3361. IEEE, 2012.
  • Ha et al. [2016] Hyowon Ha, Sunghoon Im, Jaesik Park, Hae-Gon Jeon, and In So Kweon. High-quality depth from uncalibrated small motion clip. In Proc. Conference on Computer Vision and Pattern Recognition (CVPR), pages 5413–5421, 2016.
  • Hartley and Zisserman [2003] Richard Hartley and Andrew Zisserman. Multiple View Geometry in Computer Vision. Cambridge University Press, New York, NY, USA, 2 edition, 2003.
  • Hasinoff et al. [2016] Samuel W Hasinoff, Dillon Sharlet, Ryan Geiss, Andrew Adams, Jonathan T Barron, Florian Kainz, Jiawen Chen, and Marc Levoy. Burst photography for high dynamic range and low-light imaging on mobile cameras. ACM Transactions on Graphics (ToG), 35(6):1–12, 2016.
  • Im et al. [2015] Sunghoon Im, Hyowon Ha, Gyeongmin Choe, Hae-Gon Jeon, Kyungdon Joo, and In So Kweon. High quality structure from small motion for rolling shutter cameras. In Proceedings of the IEEE International Conference on Computer Vision, pages 837–845, 2015.
  • Jorge Nocedal [2006] Stephen J. Wright Jorge Nocedal. Numerical Optimization. Springer New York, 2006.
  • Kingma and Ba [2014] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • Kopf et al. [2021] Johannes Kopf, Xuejian Rong, and Jia-Bin Huang. Robust consistent video depth estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1611–1621, 2021.
  • Lecouat et al. [2021] Bruno Lecouat, Jean Ponce, and Julien Mairal. Lucas-kanade reloaded: End-to-end super-resolution from raw image bursts. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021.
  • Lecouat et al. [2022] Bruno Lecouat, Thomas Eboli, Jean Ponce, and Julien Mairal. High dynamic range and super-resolution from raw image bursts. ACM Transactions on Graphics, 41(4), 2022.
  • Lei and Yang [2009] Cheng Lei and Yee-Hong Yang. Optical flow estimation on coarse-to-fine region-trees using discrete optimization. In 2009 IEEE 12th International Conference on Computer Vision, pages 1562–1569, 2009.
  • Liba et al. [2019] Orly Liba, Kiran Murthy, Yun-Ta Tsai, Tim Brooks, Tianfan Xue, Nikhil Karnad, Qiurui He, Jonathan T Barron, Dillon Sharlet, Ryan Geiss, et al. Handheld mobile photography in very low light. ACM Trans. Graph., 38(6):164–1, 2019.
  • Luo et al. [2022] Ziwei Luo, Youwei Li, Shen Cheng, Lei Yu, Qi Wu, Zhihong Wen, Haoqiang Fan, Jian Sun, and Shuaicheng Liu. Bsrt: Improving burst super-resolution with swin transformer and flow-guided deformable alignment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 998–1008, 2022.
  • Mehta et al. [2023] Nancy Mehta, Akshay Dudhane, Subrahmanyam Murala, Syed Waqas Zamir, Salman Khan, and Fahad Shahbaz Khan. Gated multi-resolution transfer network for burst restoration and enhancement. arXiv preprint arXiv:2304.06703, 2023.
  • Nguyen et al. [2021] Ngoc Long Nguyen, Jérémy Anger, Axel Davy, Pablo Arias, and Gabriele Facciolo. Self-supervised multi-image super-resolution for push-frame satellite images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1121–1131, 2021.
  • Paszke et al. [2019] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, pages 8024–8035. Curran Associates, Inc., 2019.
  • Ranftl et al. [2020] René Ranftl, Katrin Lasinger, David Hafner, Konrad Schindler, and Vladlen Koltun. Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. IEEE transactions on pattern analysis and machine intelligence, 44(3):1623–1637, 2020.
  • Schonberger and Frahm [2016] Johannes L Schonberger and Jan-Michael Frahm. Structure-from-motion revisited. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4104–4113, 2016.
  • Solà et al. [2018] Joan Solà, Jérémie Deray, and Dinesh Atchuthan. A micro lie theory for state estimation in robotics. CoRR, abs/1812.01537, 2018.
  • Sánchez [2016] Javier Sánchez. The Inverse Compositional Algorithm for Parametric Registration. Image Processing On Line, 6:212–232, 2016. https://doi.org/10.5201/ipol.2016.153.
  • Teed and Deng [2020] Zachary Teed and Jia Deng. Raft: Recurrent all-pairs field transforms for optical flow. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part II 16, pages 402–419. Springer, 2020.
  • Wronski et al. [2019] Bartlomiej Wronski, Ignacio Garcia-Dorado, Manfred Ernst, Damien Kelly, Michael Krainin, Chia-Kai Liang, Marc Levoy, and Peyman Milanfar. Handheld multi-frame super-resolution. ACM Transactions on Graphics (ToG), 38(4):1–18, 2019.
  • Zacur et al. [2014] Ernesto Zacur, Matias Bossa, and Salvador Olmos. Left-invariant riemannian geodesics on spatial transformation groups. SIAM Journal on Imaging Sciences, 7(3):1503–1557, 2014.

Appendix

A Closed form Jacobean for Gauss-Newton step

From Eq. (3), we recall that the residual of the robust least square for which we have to compute the Jacobian is the flat vector 𝐫𝐫\mathbf{r}bold_r with coordinates indexed by k=1..Kk=1..Kitalic_k = 1 . . italic_K, i𝐆𝑖𝐆i\in\mathbf{G}italic_i ∈ bold_G, 𝐮Pisuperscript𝐮subscript𝑃𝑖\mathbf{u}^{\prime}\in P_{i}bold_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT with total dimension noted KD𝐾𝐷KDitalic_K italic_D with D=#(𝐆)#(P)𝐷#𝐆#𝑃D=\#(\mathbf{G})\#(P)italic_D = # ( bold_G ) # ( italic_P ) size of grid by size of patch:

rk,i,𝐮=I0(𝐮)Ik(H^i,k(𝐮)).subscript𝑟𝑘𝑖superscript𝐮subscript𝐼0superscript𝐮subscript𝐼𝑘subscript^𝐻𝑖𝑘superscript𝐮r_{k,i,\mathbf{u}^{\prime}}=I_{0}(\mathbf{u}^{\prime})-I_{k}(\hat{H}_{i,k}(% \mathbf{u}^{\prime})).italic_r start_POSTSUBSCRIPT italic_k , italic_i , bold_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT = italic_I start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( bold_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) - italic_I start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( over^ start_ARG italic_H end_ARG start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT ( bold_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) . (5)

We note the twist ξksubscript𝜉𝑘\xi_{k}italic_ξ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT such that 𝐑k,𝐭k=Exp(ξk)subscript𝐑𝑘subscript𝐭𝑘Expsubscript𝜉𝑘\mathbf{R}_{k},\mathbf{t}_{k}=\operatorname{Exp}(\xi_{k})bold_R start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = roman_Exp ( italic_ξ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ). We want to find the Jacobian J𝐫subscriptJ𝐫\operatorname{J}_{\mathbf{r}}roman_J start_POSTSUBSCRIPT bold_r end_POSTSUBSCRIPT with the variable flat vector ξ=(ξk)k=1..K\mathbf{\xi}=(\xi_{k})_{k=1..K}italic_ξ = ( italic_ξ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_k = 1 . . italic_K end_POSTSUBSCRIPT. Then we note that J𝐫subscriptJ𝐫\operatorname{J}_{\mathbf{r}}roman_J start_POSTSUBSCRIPT bold_r end_POSTSUBSCRIPT is of dimensions KD,K𝐾𝐷𝐾KD,Kitalic_K italic_D , italic_K and that J𝐫subscriptJ𝐫\operatorname{J}_{\mathbf{r}}roman_J start_POSTSUBSCRIPT bold_r end_POSTSUBSCRIPT is diagonal by K𝐾Kitalic_K block of dimension D,K𝐷𝐾D,Kitalic_D , italic_K. We note JksubscriptJ𝑘\operatorname{J}_{k}roman_J start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT these blocks. Using the expression of the homography matrix in (2), the block JksubscriptJ𝑘\operatorname{J}_{k}roman_J start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT have D𝐷Ditalic_D rows of the form (ϕk,i,𝐮)superscriptsubscriptitalic-ϕ𝑘𝑖superscript𝐮top(\nabla\phi_{k,i,\mathbf{u}^{\prime}})^{\top}( ∇ italic_ϕ start_POSTSUBSCRIPT italic_k , italic_i , bold_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT where:

ϕk,i,𝐮(ξ)subscriptitalic-ϕ𝑘𝑖superscript𝐮𝜉\displaystyle\phi_{k,i,\mathbf{u}^{\prime}}(\xi)italic_ϕ start_POSTSUBSCRIPT italic_k , italic_i , bold_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_ξ ) =Ik(ψ((𝐑+𝐭𝐧i)𝐮))absentsubscript𝐼𝑘𝜓𝐑superscriptsubscript𝐭𝐧𝑖topsuperscript𝐮\displaystyle=I_{k}(\psi((\mathbf{R}+\mathbf{t}\mathbf{n}_{i}^{\top})\mathbf{u% }^{\prime}))= italic_I start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_ψ ( ( bold_R + bold_tn start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) bold_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) (6)
=Ik(ψ(𝐑1𝐧i𝐮𝐮+𝐭)),absentsubscript𝐼𝑘𝜓𝐑1superscriptsubscript𝐧𝑖topsuperscript𝐮superscript𝐮𝐭\displaystyle=I_{k}\left(\psi\left(\mathbf{R}\frac{1}{\mathbf{n}_{i}^{\top}% \mathbf{u}^{\prime}}\mathbf{u}^{\prime}+\mathbf{t}\right)\right),= italic_I start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_ψ ( bold_R divide start_ARG 1 end_ARG start_ARG bold_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG bold_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + bold_t ) ) , (7)

with ψ(x,y,z)=[x/z,y/z]𝜓𝑥𝑦𝑧superscript𝑥𝑧𝑦𝑧top\psi(x,y,z)=[x/z,y/z]^{\top}italic_ψ ( italic_x , italic_y , italic_z ) = [ italic_x / italic_z , italic_y / italic_z ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT and 𝐑,𝐭=Exp(ξ)𝐑𝐭Exp𝜉\mathbf{R},\mathbf{t}=\operatorname{Exp}(\xi)bold_R , bold_t = roman_Exp ( italic_ξ ). So if we note 𝐗i,𝐮=1𝐧i𝐮𝐮subscript𝐗𝑖superscript𝐮1superscriptsubscript𝐧𝑖topsuperscript𝐮superscript𝐮\mathbf{X}_{i,\mathbf{u}^{\prime}}=\frac{1}{\mathbf{n}_{i}^{\top}\mathbf{u}^{% \prime}}\mathbf{u}^{\prime}bold_X start_POSTSUBSCRIPT italic_i , bold_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG bold_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG bold_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT in 3superscript3\mathbb{R}^{3}blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT and Λ𝐗:SE(3)3:subscriptΛ𝐗𝑆𝐸3superscript3\Lambda_{\mathbf{X}}:SE(3)\rightarrow\mathbb{R}^{3}roman_Λ start_POSTSUBSCRIPT bold_X end_POSTSUBSCRIPT : italic_S italic_E ( 3 ) → blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT the action on 𝐗3𝐗superscript3\mathbf{X}\in\mathbb{R}^{3}bold_X ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT that takes an element of [𝐑,𝐭]𝐑𝐭[\mathbf{R},\mathbf{t}][ bold_R , bold_t ] in SE(3)𝑆𝐸3SE(3)italic_S italic_E ( 3 ) and gives its action on 𝐗𝐗\mathbf{X}bold_X: Λ𝐗(𝐑,𝐭)=𝐑𝐗+𝐭subscriptΛ𝐗𝐑𝐭𝐑𝐗𝐭\Lambda_{\mathbf{X}}(\mathbf{R},\mathbf{t})=\mathbf{R}\mathbf{X}+\mathbf{t}roman_Λ start_POSTSUBSCRIPT bold_X end_POSTSUBSCRIPT ( bold_R , bold_t ) = bold_RX + bold_t we can simplify ϕk,i,𝐮subscriptitalic-ϕ𝑘𝑖superscript𝐮\phi_{k,i,\mathbf{u}^{\prime}}italic_ϕ start_POSTSUBSCRIPT italic_k , italic_i , bold_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT to a simple composition and compute its gradient with a chain rule:

ϕk,i,𝐮subscriptitalic-ϕ𝑘𝑖superscript𝐮\displaystyle\phi_{k,i,\mathbf{u}^{\prime}}italic_ϕ start_POSTSUBSCRIPT italic_k , italic_i , bold_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT =IkψΛ𝐗i,𝐮Expabsentsubscript𝐼𝑘𝜓subscriptΛsubscript𝐗𝑖superscript𝐮Exp\displaystyle=I_{k}\circ\psi\circ\Lambda_{\mathbf{X}_{i,\mathbf{u}^{\prime}}}% \circ\operatorname{Exp}= italic_I start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∘ italic_ψ ∘ roman_Λ start_POSTSUBSCRIPT bold_X start_POSTSUBSCRIPT italic_i , bold_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∘ roman_Exp (8)
(ϕk,i,𝐮)superscriptsubscriptitalic-ϕ𝑘𝑖superscript𝐮top\displaystyle(\nabla\phi_{k,i,\mathbf{u}^{\prime}})^{\top}( ∇ italic_ϕ start_POSTSUBSCRIPT italic_k , italic_i , bold_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT =IkJψJΛ𝐗i,𝐮JExp.absentsuperscriptsubscript𝐼𝑘topsubscriptJ𝜓subscriptJsubscriptΛsubscript𝐗𝑖superscript𝐮subscriptJExp\displaystyle=\nabla I_{k}^{\top}\operatorname{J}_{\psi}\operatorname{J}_{% \Lambda_{\mathbf{X}_{i,\mathbf{u}^{\prime}}}}\operatorname{J}_{\operatorname{% Exp}}.= ∇ italic_I start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT roman_J start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT roman_J start_POSTSUBSCRIPT roman_Λ start_POSTSUBSCRIPT bold_X start_POSTSUBSCRIPT italic_i , bold_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_J start_POSTSUBSCRIPT roman_Exp end_POSTSUBSCRIPT . (9)

Note that Λ𝐗subscriptΛ𝐗\Lambda_{\mathbf{X}}roman_Λ start_POSTSUBSCRIPT bold_X end_POSTSUBSCRIPT takes input on the group SE(3)𝑆𝐸3SE(3)italic_S italic_E ( 3 ), and ExpExp\operatorname{Exp}roman_Exp has an output on the same group. However, as described in [38], using the so-called left jacobian suffices. Iksubscript𝐼𝑘\nabla I_{k}∇ italic_I start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is the spatial gradient of the image Iksubscript𝐼𝑘I_{k}italic_I start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT calculated using a convolution and a Sobel kernel and evaluated in a coordinate using bilinear interpolation. The individual Jacobians are reported drop** indexes in Table 3.

In practice, exploit the diagonal structure of J𝐫subscriptJ𝐫\operatorname{J}_{\mathbf{r}}roman_J start_POSTSUBSCRIPT bold_r end_POSTSUBSCRIPT in our implementation.

Domains Function Jacobian
32superscript3superscript2\mathbb{R}^{3}\rightarrow\mathbb{R}^{2}blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ψ(x,y,z)=[x/z,y/z]𝜓𝑥𝑦𝑧superscript𝑥𝑧𝑦𝑧top\psi(x,y,z)=[x/z,y/z]^{\top}italic_ψ ( italic_x , italic_y , italic_z ) = [ italic_x / italic_z , italic_y / italic_z ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT Jψ=1z[I2|ψ(x,y,z)]subscriptJ𝜓1𝑧delimited-[]conditionalsubscriptI2𝜓𝑥𝑦𝑧\operatorname{J}_{\psi}=\frac{1}{z}[\operatorname{I}_{2}|-\psi(x,y,z)]roman_J start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_z end_ARG [ roman_I start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | - italic_ψ ( italic_x , italic_y , italic_z ) ]
SE(3)3𝑆𝐸3superscript3SE(3)\rightarrow\mathbb{R}^{3}italic_S italic_E ( 3 ) → blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT Λ𝐗(𝐑,𝐭)=𝐑𝐗+𝐭subscriptΛ𝐗𝐑𝐭𝐑𝐗𝐭\Lambda_{\mathbf{X}}(\mathbf{R},\mathbf{t})=\mathbf{R}\mathbf{X}+\mathbf{t}roman_Λ start_POSTSUBSCRIPT bold_X end_POSTSUBSCRIPT ( bold_R , bold_t ) = bold_RX + bold_t JΛ𝐗=[𝐑|𝐑[𝐗]×]subscriptJsubscriptΛ𝐗delimited-[]conditional𝐑𝐑subscriptdelimited-[]𝐗\operatorname{J}_{\Lambda_{\mathbf{X}}}=[\mathbf{R}|-\mathbf{R}[\mathbf{X}]_{% \times}]roman_J start_POSTSUBSCRIPT roman_Λ start_POSTSUBSCRIPT bold_X end_POSTSUBSCRIPT end_POSTSUBSCRIPT = [ bold_R | - bold_R [ bold_X ] start_POSTSUBSCRIPT × end_POSTSUBSCRIPT ]
6SE(3)superscript6𝑆𝐸3\mathbb{R}^{6}\rightarrow SE(3)blackboard_R start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT → italic_S italic_E ( 3 ) Exp(ξ)Exp𝜉\operatorname{Exp}(\xi)roman_Exp ( italic_ξ ) as eq (172) in [38] JExpsubscriptJ𝐸𝑥𝑝\operatorname{J}_{Exp}roman_J start_POSTSUBSCRIPT italic_E italic_x italic_p end_POSTSUBSCRIPT as eq (179a) in [38]
Table 3: Closed form of functions needed to calculate the residual jacobian. I2subscriptI2\operatorname{I}_{2}roman_I start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is the identity matrix of size 2222

B Fixed point algorithm for reverse flow estimation

We have a depth map in the reference view (zi(0))i𝐆subscriptsuperscriptsubscript𝑧𝑖0𝑖𝐆(z_{i}^{(0)})_{i\in\mathbf{G}}( italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_i ∈ bold_G end_POSTSUBSCRIPT and we note γi(0)=1/zi(0)superscriptsubscript𝛾𝑖01superscriptsubscript𝑧𝑖0\gamma_{i}^{(0)}=1/z_{i}^{(0)}italic_γ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT = 1 / italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT the associated disparity.

Given a disparity γ𝛾\gammaitalic_γ, a relative pose 𝐑,𝐭𝐑𝐭\mathbf{R},\mathbf{t}bold_R , bold_t and 𝐮𝐮\mathbf{u}bold_u a point in the camera plane of the first view, we can calculate the image u¯¯𝑢\bar{u}over¯ start_ARG italic_u end_ARG a point on the second camera plane and γ¯¯𝛾\bar{\gamma}over¯ start_ARG italic_γ end_ARG the projected disparity in the second view frame:

u¯(𝐮,γ,𝐑,𝐭)¯𝑢𝐮𝛾𝐑𝐭\displaystyle\bar{u}\left(\mathbf{u},\gamma,\mathbf{R},\mathbf{t}\right)over¯ start_ARG italic_u end_ARG ( bold_u , italic_γ , bold_R , bold_t ) =ψ(𝐑[𝐮,1]+γ𝐭)absent𝜓𝐑𝐮1𝛾𝐭\displaystyle=\psi\left(\mathbf{R}[\mathbf{u},1]+\gamma\mathbf{t}\right)= italic_ψ ( bold_R [ bold_u , 1 ] + italic_γ bold_t ) (10)
γ¯(𝐮,γ,𝐑,𝐭)¯𝛾𝐮𝛾𝐑𝐭\displaystyle\bar{\gamma}\left(\mathbf{u},\gamma,\mathbf{R},\mathbf{t}\right)over¯ start_ARG italic_γ end_ARG ( bold_u , italic_γ , bold_R , bold_t ) =γω(𝐑[𝐮,1]+γ𝐭).absent𝛾𝜔𝐑𝐮1𝛾𝐭\displaystyle=\gamma\omega\left(\mathbf{R}[\mathbf{u},1]+\gamma\mathbf{t}% \right).= italic_γ italic_ω ( bold_R [ bold_u , 1 ] + italic_γ bold_t ) . (11)

In particular, given a regular grid of 𝐮i(0)superscriptsubscript𝐮𝑖0\mathbf{u}_{i}^{(0)}bold_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT in the reference view and the relative position of other views, 𝐑k,𝐭ksubscript𝐑𝑘subscript𝐭𝑘\mathbf{R}_{k},\mathbf{t}_{k}bold_R start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, we have the direct flow:

𝐮¯i(k)=u¯(𝐮i(0),γi(0),𝐑k,𝐭k),superscriptsubscript¯𝐮𝑖𝑘¯𝑢superscriptsubscript𝐮𝑖0superscriptsubscript𝛾𝑖0subscript𝐑𝑘subscript𝐭𝑘\displaystyle\bar{\mathbf{u}}_{i}^{(k)}=\bar{u}\left(\mathbf{u}_{i}^{(0)},% \gamma_{i}^{(0)},\mathbf{R}_{k},\mathbf{t}_{k}\right),over¯ start_ARG bold_u end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT = over¯ start_ARG italic_u end_ARG ( bold_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT , italic_γ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT , bold_R start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) , (12)

𝐮¯i(k)superscriptsubscript¯𝐮𝑖𝑘\bar{\mathbf{u}}_{i}^{(k)}over¯ start_ARG bold_u end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT is not a regular grid in the view Cksubscript𝐶𝑘C_{k}italic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, it is the image of a regular grid in the view C0subscript𝐶0C_{0}italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. The direct flow warp Iksubscript𝐼𝑘I_{k}italic_I start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT as an image Ik(0)I_{k}^{(}0)italic_I start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( end_POSTSUPERSCRIPT 0 ) in the camera plane C0subscript𝐶0C_{0}italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. It is called a backward warp. But for some applications, we also need the warp of the image I0subscript𝐼0I_{0}italic_I start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT as an image I0(k)superscriptsubscript𝐼0𝑘I_{0}^{(k)}italic_I start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT in the view Cksubscript𝐶𝑘C_{k}italic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT. This can be done using the direct flow 𝐮¯i(k)superscriptsubscript¯𝐮𝑖𝑘\bar{\mathbf{u}}_{i}^{(k)}over¯ start_ARG bold_u end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT and a forward warp, but it is known as not numerically stable. Instead, it can be computed using a backward warp and the reverse flow. The reverse flow is the other way around; the regular grid is 𝐮i(k)superscriptsubscript𝐮𝑖𝑘\mathbf{u}_{i}^{(k)}bold_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT in the camera plane Cksubscript𝐶𝑘C_{k}italic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and we want to find its antecedent 𝐮¯i(0)superscriptsubscript¯𝐮𝑖0\bar{\mathbf{u}}_{i}^{(0)}over¯ start_ARG bold_u end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT in the camera plane of C0subscript𝐶0C_{0}italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. The reverse can be computed using (10) using the inverse of the relative position and the disparity map in the view Cksubscript𝐶𝑘C_{k}italic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT. The inverse of the relative position is the inverse in SE(3)𝑆𝐸3SE(3)italic_S italic_E ( 3 ), and it is 𝐑k,𝐑k𝐭ksuperscriptsubscript𝐑𝑘topsuperscriptsubscript𝐑𝑘topsubscript𝐭𝑘\mathbf{R}_{k}^{\top},-\mathbf{R}_{k}^{\top}\mathbf{t}_{k}bold_R start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT , - bold_R start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT. On the other hand, the disparity map in the view Cksubscript𝐶𝑘C_{k}italic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is not known. However, using the inverse relative position, the disparity γ(k)superscript𝛾𝑘\gamma^{(k)}italic_γ start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT in a point 𝐮(k)superscript𝐮𝑘\mathbf{u}^{(k)}bold_u start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT in the camera plane of Cksubscript𝐶𝑘C_{k}italic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT must match the known one in C0subscript𝐶0C_{0}italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT:

γ¯(𝐮(k),γ(k),𝐑k,𝐑k𝐭k)=Γ(0)(𝐮¯(𝐮(k),γ(k),𝐑k,𝐑k𝐭k)),¯𝛾superscript𝐮𝑘superscript𝛾𝑘subscriptsuperscript𝐑top𝑘subscriptsuperscript𝐑top𝑘subscript𝐭𝑘superscriptΓ0¯𝐮superscript𝐮𝑘superscript𝛾𝑘subscriptsuperscript𝐑top𝑘subscriptsuperscript𝐑top𝑘subscript𝐭𝑘\displaystyle\bar{\gamma}(\mathbf{u}^{(k)},\gamma^{(k)},\mathbf{R}^{\top}_{k},% -\mathbf{R}^{\top}_{k}\mathbf{t}_{k})=\Gamma^{(0)}\left(\bar{\mathbf{u}}(% \mathbf{u}^{(k)},\gamma^{(k)},\mathbf{R}^{\top}_{k},-\mathbf{R}^{\top}_{k}% \mathbf{t}_{k})\right),over¯ start_ARG italic_γ end_ARG ( bold_u start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT , italic_γ start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT , bold_R start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , - bold_R start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT bold_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) = roman_Γ start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ( over¯ start_ARG bold_u end_ARG ( bold_u start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT , italic_γ start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT , bold_R start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , - bold_R start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT bold_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) , (13)

where Γ(0)superscriptΓ0\Gamma^{(0)}roman_Γ start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT is the disparity function on the whole camera plane of C0subscript𝐶0C_{0}italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT using interpolation and the depth map (zi(0))i𝐆subscriptsuperscriptsubscript𝑧𝑖0𝑖𝐆(z_{i}^{(0)})_{i\in\mathbf{G}}( italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_i ∈ bold_G end_POSTSUBSCRIPT. This equation can be interpreted as the reprojection of the disparity in Cksubscript𝐶𝑘C_{k}italic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT must match the disparity in C0subscript𝐶0C_{0}italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT evaluated in the flow induced by the disparity in Cksubscript𝐶𝑘C_{k}italic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT. It is an implicit equation for γ(k)superscript𝛾𝑘\gamma^{(k)}italic_γ start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT. Using (11) again from C0subscript𝐶0C_{0}italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT to Cksubscript𝐶𝑘C_{k}italic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, it can be converted as a fixed point equation γ(k)=F(γ(k))superscript𝛾𝑘𝐹superscript𝛾𝑘\gamma^{(k)}=F\left(\gamma^{(k)}\right)italic_γ start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT = italic_F ( italic_γ start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ) when defining F𝐹Fitalic_F as :

F(γ)=γ¯(𝐮,Γ(0)(𝐮),𝐑k,𝐭k)with𝐮=𝐮¯(𝐮(k),γ,𝐑k,𝐑k𝐭k).formulae-sequence𝐹𝛾¯𝛾𝐮superscriptΓ0𝐮subscript𝐑𝑘subscript𝐭𝑘with𝐮¯𝐮superscript𝐮𝑘𝛾subscriptsuperscript𝐑top𝑘subscriptsuperscript𝐑top𝑘subscript𝐭𝑘\displaystyle F\left(\gamma\right)=\bar{\gamma}\left(\mathbf{u},\Gamma^{(0)}(% \mathbf{u}),\mathbf{R}_{k},\mathbf{t}_{k}\right)\quad\text{with}\quad\mathbf{u% }=\bar{\mathbf{u}}\left(\mathbf{u}^{(k)},\gamma,\mathbf{R}^{\top}_{k},-\mathbf% {R}^{\top}_{k}\mathbf{t}_{k}\right).italic_F ( italic_γ ) = over¯ start_ARG italic_γ end_ARG ( bold_u , roman_Γ start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ( bold_u ) , bold_R start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) with bold_u = over¯ start_ARG bold_u end_ARG ( bold_u start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT , italic_γ , bold_R start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , - bold_R start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT bold_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) .

Then we can estimate the disparity map (γi(k))i𝐆subscriptsuperscriptsubscript𝛾𝑖𝑘𝑖𝐆(\gamma_{i}^{(k)})_{i\in\mathbf{G}}( italic_γ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_i ∈ bold_G end_POSTSUBSCRIPT in view Cksubscript𝐶𝑘C_{k}italic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT using a regular grid 𝐮i(k)superscriptsubscript𝐮𝑖𝑘\mathbf{u}_{i}^{(k)}bold_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT in the view Cksubscript𝐶𝑘C_{k}italic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and using a fixed point algorithm with a function F𝐹Fitalic_F for every pixel using 𝐮i(k)superscriptsubscript𝐮𝑖𝑘\mathbf{u}_{i}^{(k)}bold_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT.

We build the sequence for γi,m(k)superscriptsubscript𝛾𝑖𝑚𝑘\gamma_{i,m}^{(k)}italic_γ start_POSTSUBSCRIPT italic_i , italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT with m>0𝑚0m>0italic_m > 0 as:

γi,m+1(k)=γ¯(𝐮i,k,m,Γ(0)(𝐮i,k,m),𝐑k,𝐭k)with𝐮i,k,m=u¯(𝐮i(k),γi,m(k),𝐑k,𝐑k𝐭k),formulae-sequencesubscriptsuperscript𝛾𝑘𝑖𝑚1¯𝛾subscript𝐮𝑖𝑘𝑚superscriptΓ0subscript𝐮𝑖𝑘𝑚subscript𝐑𝑘subscript𝐭𝑘withsubscript𝐮𝑖𝑘𝑚¯𝑢superscriptsubscript𝐮𝑖𝑘subscriptsuperscript𝛾𝑘𝑖𝑚superscriptsubscript𝐑𝑘topsuperscriptsubscript𝐑𝑘topsubscript𝐭𝑘\displaystyle\gamma^{(k)}_{i,m+1}=\bar{\gamma}\left(\mathbf{u}_{i,k,m},\Gamma^% {(0)}(\mathbf{u}_{i,k,m}),\mathbf{R}_{k},\mathbf{t}_{k}\right)\quad\text{with}% \quad\mathbf{u}_{i,k,m}=\bar{u}\left(\mathbf{u}_{i}^{(k)},\gamma^{(k)}_{i,m},% \mathbf{R}_{k}^{\top},-\mathbf{R}_{k}^{\top}\mathbf{t}_{k}\right),italic_γ start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_m + 1 end_POSTSUBSCRIPT = over¯ start_ARG italic_γ end_ARG ( bold_u start_POSTSUBSCRIPT italic_i , italic_k , italic_m end_POSTSUBSCRIPT , roman_Γ start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ( bold_u start_POSTSUBSCRIPT italic_i , italic_k , italic_m end_POSTSUBSCRIPT ) , bold_R start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) with bold_u start_POSTSUBSCRIPT italic_i , italic_k , italic_m end_POSTSUBSCRIPT = over¯ start_ARG italic_u end_ARG ( bold_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT , italic_γ start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_m end_POSTSUBSCRIPT , bold_R start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT , - bold_R start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) , (14)

as the motion baseline is small, we initialize the disparity map in view Cksubscript𝐶𝑘C_{k}italic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT by the one in C0subscript𝐶0C_{0}italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT: γi,0(k)=γi(0)subscriptsuperscript𝛾𝑘𝑖0subscriptsuperscript𝛾0𝑖\gamma^{(k)}_{i,0}=\gamma^{(0)}_{i}italic_γ start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , 0 end_POSTSUBSCRIPT = italic_γ start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and we can use the composition of the two flows (direct and reverse) as a convergence error:

ϵi,k,m=𝐮i(k)𝐮¯(𝐮i,k,m,Γ(0)(𝐮i,k,m),𝐑k,𝐭k)subscriptitalic-ϵ𝑖𝑘𝑚normsuperscriptsubscript𝐮𝑖𝑘¯𝐮subscript𝐮𝑖𝑘𝑚superscriptΓ0subscript𝐮𝑖𝑘𝑚subscript𝐑𝑘subscript𝐭𝑘\displaystyle\epsilon_{i,k,m}=\left\|\mathbf{u}_{i}^{(k)}-\bar{\mathbf{u}}% \left(\mathbf{u}_{i,k,m},\Gamma^{(0)}(\mathbf{u}_{i,k,m}),\mathbf{R}_{k},% \mathbf{t}_{k}\right)\right\|italic_ϵ start_POSTSUBSCRIPT italic_i , italic_k , italic_m end_POSTSUBSCRIPT = ∥ bold_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT - over¯ start_ARG bold_u end_ARG ( bold_u start_POSTSUBSCRIPT italic_i , italic_k , italic_m end_POSTSUBSCRIPT , roman_Γ start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ( bold_u start_POSTSUBSCRIPT italic_i , italic_k , italic_m end_POSTSUBSCRIPT ) , bold_R start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∥ (15)

The value of i𝑖iitalic_i and k𝑘kitalic_k for which the sequence does not converge correspond to the occlusion of the element projected in 𝐮i(k)superscriptsubscript𝐮𝑖𝑘\mathbf{u}_{i}^{(k)}bold_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT between view Cksubscript𝐶𝑘C_{k}italic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and C0subscript𝐶0C_{0}italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. We can build an occlusion mask using the convergence criterion. Examples of these masks are available in appendix G. For the value of i𝑖iitalic_i and k𝑘kitalic_k for which the sequence does converge γi(k)subscriptsuperscript𝛾𝑘𝑖\gamma^{(k)}_{i}italic_γ start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.

Finally, the reverse flow is given by:

𝐮¯i(0)=u¯(𝐮i(k),γi(k),𝐑k,𝐑k𝐭k).superscriptsubscript¯𝐮𝑖0¯𝑢superscriptsubscript𝐮𝑖𝑘subscriptsuperscript𝛾𝑘𝑖superscriptsubscript𝐑𝑘topsuperscriptsubscript𝐑𝑘topsubscript𝐭𝑘\displaystyle\bar{\mathbf{u}}_{i}^{(0)}=\bar{u}\left(\mathbf{u}_{i}^{(k)},% \gamma^{(k)}_{i},\mathbf{R}_{k}^{\top},-\mathbf{R}_{k}^{\top}\mathbf{t}_{k}% \right).over¯ start_ARG bold_u end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT = over¯ start_ARG italic_u end_ARG ( bold_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT , italic_γ start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_R start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT , - bold_R start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) . (16)

C Determinant regularization

The idea behind this regularization is that when the gradient is small, we will favor the direction of descent for the structure that deforms the current flow the least. To do this, we look at the flow effect on the center of the patches regularly distributed on the 𝐆(l)superscript𝐆𝑙\mathbf{G}^{(l)}bold_G start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT grid. We note i=ix,iy𝑖subscript𝑖𝑥subscript𝑖𝑦i=i_{x},i_{y}italic_i = italic_i start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_i start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT the i𝑖iitalic_i elements of the grid 𝐆(l)superscript𝐆𝑙\mathbf{G}^{(l)}bold_G start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT with ix=1..Wli_{x}=1..W_{l}italic_i start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT = 1 . . italic_W start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT and iy=1..Hli_{y}=1..H_{l}italic_i start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT = 1 . . italic_H start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT. We note (𝐮ix,iy)subscript𝐮subscript𝑖𝑥subscript𝑖𝑦(\mathbf{u}_{i_{x},i_{y}})( bold_u start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_i start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) the center point of the pixel in the corresponding image plane, and we suppose that the coordinates of 𝐮ix,iysubscript𝐮subscript𝑖𝑥subscript𝑖𝑦\mathbf{u}_{i_{x},i_{y}}bold_u start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_i start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT end_POSTSUBSCRIPT are normalized and evolve in a range [1,1]11[-1,1][ - 1 , 1 ]. A parallelogram constituted by the points (𝐮ix,iy,𝐮ix+1,iy,𝐮ix+1,iy+1,𝐮ix,iy+1)subscript𝐮subscript𝑖𝑥subscript𝑖𝑦subscript𝐮subscript𝑖𝑥1subscript𝑖𝑦subscript𝐮subscript𝑖𝑥1subscript𝑖𝑦1subscript𝐮subscript𝑖𝑥subscript𝑖𝑦1(\mathbf{u}_{i_{x},i_{y}},\mathbf{u}_{i_{x}+1,i_{y}},\mathbf{u}_{i_{x}+1,i_{y}% +1},\mathbf{u}_{i_{x},i_{y}+1})( bold_u start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_i start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT end_POSTSUBSCRIPT , bold_u start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT + 1 , italic_i start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT end_POSTSUBSCRIPT , bold_u start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT + 1 , italic_i start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT , bold_u start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_i start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT ) thus has a normalized area of 4/(HlWl)4subscript𝐻𝑙subscript𝑊𝑙4/(H_{l}W_{l})4 / ( italic_H start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ). We compare independently, for each view k𝑘kitalic_k and each grid mesh element, the normalized area of the mesh after application of the local homographic flow and the constant area noted 𝐮¯ix,iy(k)=H^(ix,iy),k(𝐮ix,iy)superscriptsubscript¯𝐮subscript𝑖𝑥subscript𝑖𝑦𝑘subscript^𝐻subscript𝑖𝑥subscript𝑖𝑦𝑘subscript𝐮subscript𝑖𝑥subscript𝑖𝑦\bar{\mathbf{u}}_{i_{x},i_{y}}^{(k)}=\hat{H}_{(i_{x},i_{y}),k}(\mathbf{u}_{i_{% x},i_{y}})over¯ start_ARG bold_u end_ARG start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_i start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT = over^ start_ARG italic_H end_ARG start_POSTSUBSCRIPT ( italic_i start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_i start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ) , italic_k end_POSTSUBSCRIPT ( bold_u start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_i start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT end_POSTSUBSCRIPT ). We penalize the ratio of the area of each parallelogram before and after the homography flow to 1. The penalization reads:

P=𝑃absent\displaystyle P=italic_P = k=1Kix=1Wl1iy=1Hl1|𝒜ix,iy(k)/24/(HlWl)1|superscriptsubscript𝑘1𝐾superscriptsubscriptsubscript𝑖𝑥1subscript𝑊𝑙1superscriptsubscriptsubscript𝑖𝑦1subscript𝐻𝑙1superscriptsubscript𝒜subscript𝑖𝑥subscript𝑖𝑦𝑘24subscript𝐻𝑙subscript𝑊𝑙1\displaystyle\sum_{k=1}^{K}\sum_{i_{x}=1}^{W_{l}-1}\sum_{i_{y}=1}^{H_{l}-1}% \left|\frac{\mathcal{A}_{i_{x},i_{y}}^{(k)}/2}{4/(H_{l}W_{l})}-1\right|∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_W start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT | divide start_ARG caligraphic_A start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_i start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT / 2 end_ARG start_ARG 4 / ( italic_H start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) end_ARG - 1 | (17)
𝒜ix,iy(k)=superscriptsubscript𝒜subscript𝑖𝑥subscript𝑖𝑦𝑘absent\displaystyle\mathcal{A}_{i_{x},i_{y}}^{(k)}=caligraphic_A start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_i start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT = det(𝐮¯ix+1,iy(k)𝐮¯ix,iy(k),𝐮¯ix,iy+1(k)𝐮¯ix,iy(k))detsuperscriptsubscript¯𝐮subscript𝑖𝑥1subscript𝑖𝑦𝑘superscriptsubscript¯𝐮subscript𝑖𝑥subscript𝑖𝑦𝑘superscriptsubscript¯𝐮subscript𝑖𝑥subscript𝑖𝑦1𝑘superscriptsubscript¯𝐮subscript𝑖𝑥subscript𝑖𝑦𝑘\displaystyle\operatorname{det}(\bar{\mathbf{u}}_{i_{x}+1,i_{y}}^{(k)}-\bar{% \mathbf{u}}_{i_{x},i_{y}}^{(k)},\bar{\mathbf{u}}_{i_{x},i_{y}+1}^{(k)}-\bar{% \mathbf{u}}_{i_{x},i_{y}}^{(k)})roman_det ( over¯ start_ARG bold_u end_ARG start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT + 1 , italic_i start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT - over¯ start_ARG bold_u end_ARG start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_i start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT , over¯ start_ARG bold_u end_ARG start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_i start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT - over¯ start_ARG bold_u end_ARG start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_i start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ) (18)
+det(𝐮¯ix,iy+1(k)𝐮¯ix+1,iy+1(k),𝐮¯ix+1,iy(k)𝐮¯ix+1,iy+1(k)),detsuperscriptsubscript¯𝐮subscript𝑖𝑥subscript𝑖𝑦1𝑘superscriptsubscript¯𝐮subscript𝑖𝑥1subscript𝑖𝑦1𝑘superscriptsubscript¯𝐮subscript𝑖𝑥1subscript𝑖𝑦𝑘superscriptsubscript¯𝐮subscript𝑖𝑥1subscript𝑖𝑦1𝑘\displaystyle+\operatorname{det}(\bar{\mathbf{u}}_{i_{x},i_{y}+1}^{(k)}-\bar{% \mathbf{u}}_{i_{x}+1,i_{y}+1}^{(k)},\bar{\mathbf{u}}_{i_{x}+1,i_{y}}^{(k)}-% \bar{\mathbf{u}}_{i_{x}+1,i_{y}+1}^{(k)}),+ roman_det ( over¯ start_ARG bold_u end_ARG start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_i start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT - over¯ start_ARG bold_u end_ARG start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT + 1 , italic_i start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT , over¯ start_ARG bold_u end_ARG start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT + 1 , italic_i start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT - over¯ start_ARG bold_u end_ARG start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT + 1 , italic_i start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ) , (19)

where 𝒜ix,iysubscript𝒜subscript𝑖𝑥subscript𝑖𝑦\mathcal{A}_{i_{x},i_{y}}caligraphic_A start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_i start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT end_POSTSUBSCRIPT is the double of the area of the parallelogram (𝐮¯ix,iy(k),𝐮¯ix+1,iy(k),𝐮¯ix+1,iy+1(k),𝐮¯ix,iy+1(k))superscriptsubscript¯𝐮subscript𝑖𝑥subscript𝑖𝑦𝑘superscriptsubscript¯𝐮subscript𝑖𝑥1subscript𝑖𝑦𝑘superscriptsubscript¯𝐮subscript𝑖𝑥1subscript𝑖𝑦1𝑘superscriptsubscript¯𝐮subscript𝑖𝑥subscript𝑖𝑦1𝑘(\bar{\mathbf{u}}_{i_{x},i_{y}}^{(k)},\bar{\mathbf{u}}_{i_{x}+1,i_{y}}^{(k)},% \bar{\mathbf{u}}_{i_{x}+1,i_{y}+1}^{(k)},\bar{\mathbf{u}}_{i_{x},i_{y}+1}^{(k)})( over¯ start_ARG bold_u end_ARG start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_i start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT , over¯ start_ARG bold_u end_ARG start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT + 1 , italic_i start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT , over¯ start_ARG bold_u end_ARG start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT + 1 , italic_i start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT , over¯ start_ARG bold_u end_ARG start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_i start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ) using determinant on the two halves triangle as illustrated in figure 9.

Refer to caption
Figure 9: Illustration of the determinant regularization.

D Additional details on the datasets and on the experiments

Proposed datasets.

We generated two datasets: Blender 1 with small baselines and Blender 2 with micro-baselines. The first dataset exhibits larger parallax effects, while the second dataset has reduced parallax effects. Detailed characteristics of these datasets are provided in Table 4.

Dataset Scenes Frames
Std baselines
(m)
Std rotations
(deg)
Max depth
(m)
Min depth
(m)
Mean depth
(m)
Blender 1 15 20 0.116 0.20 0.316 11.234 3.73
Blender 2 10 20 0.010 0.29 1.92 19.453 6.21
Table 4: Main characteristics of the two proposed datasets.

Experiments.

In our experiments, whose results are reported in Table 1 and Table 2, we evaluated the performance of the Saop method [11] by calculating the average results across all scenes where Saop successfully converged. On the blender 1 dataset, we excluded one scene where Saop did not converge. Excluding this scene for Saop does not change the methods’ ranking and our experiments’ conclusion.

Method Abs rel \downarrow Sqr rel \downarrow RMSE\downarrow Delta 1\uparrow Delta 2 \uparrow Delta 3 \uparrow
Blender 1 (small motion)
Midas [36] 0.1589 1.0747 1.3148 0.8019 0.951 0.9824
RCVD [27] 0.2038 1.0622 1.3888 0.698 0.9191 0.9684
Ours 0.1544 0.2229 0.9258 0.7881 0.9544 0.9911
Blender 2 (micro motion)
Midas [36] 0.0790 0.0786 0.7166 0.9429 0.9929 0.9986
RCVD [27] 0.0971 0.1131 0.8244 0.9149 0.988 0.9973
Ours 0.1763 0.2875 1.3711 0.6857 0.9594 0.9976
Table 5: Depth errors metrics on the two proposed synthetic bursts datasets.

E Ablation study

We make an ablation study to understand the impact of the different choices in our modeling and algorithm. We compare the global algorithm to an identical algorithm using the same hyperparameters but, respectively, without the exponential parametrization of the motion, without the newton step, using spatial regularization (total variation and determinant), without the plan parametrization, with patches of size one, i.e., a pixel-wise loss and without the multiscale approach. We report the performance on the fllow estimate in Table. 6, and depth/pose in Table. 7.

Method
EPE
\downarrow
RMSE
\downarrow
NPE1
\uparrow
NPE2
\uparrow
NPE3
\uparrow
EPE
\downarrow
RMSE
\downarrow
NPE1
\uparrow
NPE2
\uparrow
NPE3
\uparrow
Blender 1 (small motion) Blender 2 (micro motion)
Base 0.7439 1.4324 0.7841 0.9084 0.9456 0.2321 0.2820 0.9366 0.9972 1.0000
with regularization 0.7641 1.4596 0.7732 0.9024 0.9432 0.2660 0.3286 0.9297 0.9937 0.9997
with k=1𝑘1k=1italic_k = 1 (pixelwise) 0.8102 1.4705 0.7512 0.8909 0.9377 0.2834 0.3482 0.9220 0.9940 0.9997
w/o plan parametrization 544.3828 3151.1105 0.2983 0.4619 0.5569 219.6589 1234.7524 0.6671 0.7301 0.7420
w/o exponential parametrization 0.7685 1.4629 0.7721 0.9013 0.9421 0.2658 0.3294 0.9294 0.9937 0.9997
w/o newton step 0.7676 1.4630 0.7725 0.9015 0.9422 0.2741 0.3402 0.9272 0.9933 0.9997
Table 6: Optical flow errors.
Pose Depth
Method
Left l2
(m)\downarrow
ATE
(m) \downarrow
Geom
(m) \downarrow
Biinvrot l2
(deg) \downarrow
Abs rel \downarrow Sqr rel \downarrow RMSE\downarrow Delta 1\uparrow Delta 2 \uparrow Delta 3 \uparrow
Dataset Blender 1 (small motion)
Base 0.0066 0.0056 0.0050 0.1806 0.1381 0.2391 0.8688 0.8358 0.9263 0.9761
with regularization 0.0072 0.0062 0.0053 0.1850 0.1399 0.2462 0.8777 0.8344 0.9236 0.9759
with k=1𝑘1k=1italic_k = 1 (pixelwise) 0.0073 0.0062 0.0054 0.1883 0.1538 0.2673 0.9217 0.8087 0.9220 0.9734
w/o plan parametrization 0.0267 0.0250 0.0221 0.4317 0.4518 0.7926 1.7602 0.2319 0.5076 0.7142
w/o exponential parametrization 0.0073 0.0062 0.0054 0.1865 0.1392 0.2398 0.8694 0.8340 0.9252 0.9763
w/o newton step 0.0073 0.0062 0.0054 0.1861 0.1393 0.2397 0.8696 0.8342 0.9253 0.9763
Dataset Blender 2 (micro motion)
Base 0.0022 0.0022 0.0020 0.0245 0.1383 0.1962 1.1521 0.7996 0.9819 0.9983
with regularization 0.0023 0.0022 0.0020 0.0256 0.1766 0.2935 1.3640 0.6943 0.9498 0.9948
with k=1𝑘1k=1italic_k = 1 (pixelwise) 0.0024 0.0024 0.0022 0.0287 0.1750 0.2928 1.3629 0.6932 0.9503 0.9952
w/o plan parametrization 0.0040 0.0039 0.0037 0.0430 0.2571 0.5270 1.7727 0.5005 0.8881 0.9857
w/o exponential parametrization 0.0022 0.0022 0.0020 0.0261 0.1755 0.2908 1.3590 0.6981 0.9510 0.9949
w/o newton step 0.0023 0.0022 0.0021 0.0258 0.1818 0.3040 1.3837 0.6807 0.9458 0.9946
Table 7: Pose and depth errors metrics on the two proposed synthetic bursts datasets.

F Comparison with monocular method

Monocular depth estimation methods can only estimate depth up to an affine transformation. Therefore, we evaluate them up to an affine correction. It does not make sense to compare them to the binocular method with linear correction as in Table 2. On the other hand, to compare them to the latter, we must recalculate the error of each of the methods in Table 2 with an affine correction. The results are reported in Table 5.

G Estimated occlusion mask

We use the fixed point algorithm described in B on the depth map obtained at the optimization’s last step and note the points for which the fixed point algorithm does not converge. We use a threshold and a maximum number of iterations to construct the non-convergent set. This set constitutes a partial occlusion mask. It can be used in downstream tasks to avoid aggregating erroneous information because it is occluded. Fig. 10 shows examples of masks on synthetic data.

Refer to caption Refer to caption
Refer to caption Refer to caption
Refer to caption Refer to caption
Refer to caption Refer to caption
Figure 10: Partial occlusion mask obtained using the fixed point algorithm for four examples of the Blender 2 dataset.

H Depthmaps

We provide additional examples of depth maps from both synthetic bursts (Fig. 11) and real bursts (Fig. 12). All disparity maps were aligned to the ground truth with an affine transform by using the least square criterion of [36].

For a fair comparison, we also show the results of DfUSMC without their additional depth map filtering, which is essential to obtain a visually appealing depth map. However, this step introduces a stratification of the depth map, which is not present in our method.

Learning based Optimization based
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Ref. image Groundtruth Midas [36] RCVD[27] Saop [11] DfUSMC [21] DfUSMC no filt. Ours Ours + reg.
Figure 11: Depth estimation from synthetic bursts (Blender 2 dataset).
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Ref. image Midas [36] RCVD[27] DfUSMC [21] Ours
Figure 12: Depth estimation from real bursts.

I Pose estimation visualization

To visualize the positions the algorithm approximates, we can look at the translation part of the positions. Because our images come from a burst, we use the temporal coherence of the series of pictures and can trace the trajectory of the camera center during the burst. After rescaling, we compare the trajectory approximated by the algorithm to the trajectory used to create the burst in Blender. Fig. 13 shows examples of trajectories for different images of the Blender 2 dataset during the last three stages.

Scene Stage 0 Stage 1 Stage 2
Fig. 10 row 1 Refer to caption Refer to caption Refer to caption
Fig. 10 row 2 Refer to caption Refer to caption Refer to caption
Fig. 10 row 3 Refer to caption Refer to caption Refer to caption
Fig. 10 row 4 Refer to caption Refer to caption Refer to caption
Fig. 11 row 2 Refer to caption Refer to caption Refer to caption
Fig. 11 row 3 Refer to caption Refer to caption Refer to caption
Figure 13: Trajectory at different scales of the coarse to fine approach for all the scenes shown in Fig. 10 and Fig. 11.

J Visual inspection of the registration of real frames

Fig. 14 visually demonstrates the alignment quality achieved with our method on a real burst. To assess the alignment quality, we generate images by overlaying the green and blue channels of the warped source images onto the red channel of the target image, following a similar approach as [12] In this example, we observe that the majority of the frames exhibit a good alignment, while a few frames (5 out of 15) show inadequate alignment particularly in certain regions of the foreground (see for example the books or the plant).

Refer to caption
Figure 14: Qualitative alignment results of our method on a real burst. Images are generated by superimposing the warped source images on the target image.

K Super-resolution on real bursts

To showcase the ability of our method to produce fine alignments on real images, we perform burst super-resolution (SR) with our alignments. To achieve the task, we use the popular inverse problem framework employed in [18, 28]. To recover the high-resolution image 𝐱𝐱\mathbf{x}bold_x from a set of K𝐾Kitalic_K noisy and low-resolution observations 𝐲isubscript𝐲𝑖\mathbf{y}_{i}bold_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT with i[0,K]𝑖0𝐾i\in[0,K]italic_i ∈ [ 0 , italic_K ] we solve the minimization problem min𝐱iKDBWi𝐱𝐲i22,subscript𝐱superscriptsubscript𝑖𝐾superscriptsubscriptnorm𝐷𝐵subscript𝑊𝑖𝐱subscript𝐲𝑖22\min_{\mathbf{x}}\sum_{i}^{K}\|DBW_{i}\mathbf{x}-\mathbf{y}_{i}\|_{2}^{2},roman_min start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ∥ italic_D italic_B italic_W start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_x - bold_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , with a gradient descent algorithm.

D𝐷Ditalic_D is a decimation operator that reduces spatial resolution, B𝐵Bitalic_B is a blurring operator, and W𝑊Witalic_W is a warp parametrized by the optical flow. In our experiments, DB𝐷𝐵DBitalic_D italic_B is chosen as the average pooling operator following [28]. The gradient can be derived as iKWiBD(DBWi𝐱𝐲i)superscriptsubscript𝑖𝐾superscriptsubscript𝑊𝑖topsuperscript𝐵topsuperscript𝐷top𝐷𝐵subscript𝑊𝑖𝐱subscript𝐲𝑖\sum_{i}^{K}W_{i}^{\top}B^{\top}D^{\top}(DBW_{i}\mathbf{x}-\mathbf{y}_{i})∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_W start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_B start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_D start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( italic_D italic_B italic_W start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_x - bold_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ).

The optical flow to warp the reference high-resolution image 𝐱𝐱\mathbf{x}bold_x candidate is estimated in two steps using our method and then the fixed point algorithm presented in Sec. 3 to infer the motion field of interest. We perform super-resolution on RGB images in linear space demosaicked RAW frames with bilinear filtering. Joint super-resolution and demosaicking is left for future work.

We visually compare our results in Figure 15. Our algorithm can recover fine details, including, for instance, the fine texture on the rum bottle or the hair of the doll, that were not distinguishable in the original frames.

Refer to caption Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption
Figure 15: burst super-resolution on real raw bursts exploiting our alignment method. Top: low-resolution crops. Bottom: super-resolution exploiting our alignment method. Data kindly provided by the authors of [28]. It is best seen by zooming aggressively on a computer screen.