Parallax-tolerant Image Stitching via Segmentation-guided Multi-homography War**

Tianli Liao, Lei Li and Guangen Liu
College of Information Science and Engineering, Henan University of Technology
{tianli.liao, leili, lgendd_99}@haut.edu.cn
   Ce Wang
Hong Kong University of Science and Technology
[email protected]
   Nan Li
School of Mathematical Sciences, Shenzhen University
[email protected]
Abstract

Large parallax between images is an intractable issue in image stitching. Various war**-based methods are proposed to address it, yet the results are unsatisfactory. In this paper, we propose a novel image stitching method using multi-homography war** guided by image segmentation. Specifically, we leverage the Segment Anything Model to segment the target image into numerous contents and partition the feature points into multiple subsets via the energy-based multi-homography fitting algorithm. The multiple subsets of feature points are used to calculate the corresponding multiple homographies. For each segmented content in the overlap** region, we select its best-fitting homography with the lowest photometric error. For each segmented content in the non-overlap** region, we calculate a weighted combination of the linearized homographies. Finally, the target image is warped via the best-fitting homographies to align with the reference image, and the final panorama is generated via linear blending. Comprehensive experimental results on the public datasets demonstrate that our method provides the best alignment accuracy by a large margin, compared with the state-of-the-art methods. The source code is available at https://github.com/tlliao/multi-homo-warp.

1 Introduction

Refer to caption
Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Refer to caption
(e)
Refer to caption
(f)
Refer to caption
(g)
Refer to caption
(h)
Refer to caption
(i)
Refer to caption
(j)
Refer to caption
(k)
Figure 1: Stitching results comparison with various war** methods. All results are generated via simple linear blending. Obvious parallax can be seen from the tower and the red statue. (e)-(g) correspond to spatially-varying warps, (h)-(j) correspond to mesh-based warps, and (k) is the learning-based warp. Compared with other warps, our method achieves the best alignment quality.

Image stitching is a widely used technology in various computer vision applications and consumer photography. For captured images with a narrow field-of-view (FoV), the aim is to construct the panorama with a wider FoV scene. Although, there are several academic and commercial stitching tools published, e.g., Adobe Photoshop Photomerge, Microsoft Image Composite Editor, and Autostitch [2], image stitching with large parallax is still an intractable problem. Various stitching methods are proposed to handle it using different strategies.

Most of the existing methods follow a similar pipeline with two steps [31]: Image war** and alignment are first followed by image composition and blending. Generally, the crucial image war** step determines the stitching quality, and the homography is a simple and commonly used war** model representing the planar transformation between the input scenes. However, it has the assumptions of the planar captured scenes or purely rotational camera motion. Violating such assumptions leads to an undesirable parallax between images, making the homography fail to deal with. As the case of Fig. 1(d) shows, severe ghosting artifacts are generated in the stitching results.

To alleviate the artifacts caused by parallax, adaptive war** models are proposed using either spatially-varying warps [11, 4, 17, 21, 20, 23, 26, 34, 39, 25] or mesh-based warps [3, 6, 36, 33, 22, 15, 18, 38, 5, 9]. In particular, the spatially varying warps aim to create a spatially varying motion field in the image domain. The warps provide better alignment in the overlap** region and can be smoothly extrapolated to the non-overlap** region. To formulate the war** model as the energy minimization of the mesh deformation, mesh-based warps partition the images into regular meshes. Then different energy functions are designed to maintain the smoothness of the warps, improving the alignment accuracy and preventing distortions. Designing the deep learning framework to address image stitching [30] is another way to handle the parallax, while it still has the smoothness constraint. For images with larger parallax, the warps with smoothness constraints easily fail in aligning the foreground objects and background scenes simultaneously, as shown in Fig. 1(e-k). Intuitively, aligning the objects and scenes separately by different homographies and fusing them will improve the alignment. Motivated by the remarkable segmentation performance of the segment anything model (SAM) [16], as shown in Fig. 1(b), we propose the multi-homography war** method which achieves the best alignment accuracy, as shown in Fig. 1(c).

In this paper, we propose a stitching method via multi-homography war** for images with large parallax. Firstly, we segment the target image into numerous contents via SAM and detect feature points from input images; followed by partitioning them into multiple subsets via the energy-based multi-homography fitting algorithm to generate multiple homographies. Then, for each segmented content in the overlap** region, we select (label) its best-fitting homography with the lowest photometric error. For each content in the non-overlap** region, we calculate (label) a weighted combination of the linearized homographies. Finally, the target image is warped based on the homographies and aligned with the reference image. Experimental results show that our method can accurately align different image contents with large parallax and outperforms the state-of-the-art war** methods by a large margin. The main contributions of our work are 3-fold as follows:

  • We propose a stitching method to integrate the SAM into the war** model generation and it achieves the best alignment accuracy.

  • The multi-homography fitting algorithm we designed can provide better homography models than the iterative RANSAC, especially for images with large parallax. Our algorithm is more stable and robust to the outliers of the feature matches.

  • We propose a parameter-free forward and backward war** strategy to warp different contents in the target image via multiple homographies and generate well-aligned results with correct occlusions (holes).

The rest of this paper is organized as follows. Section 2 introduces the related work of image stitching. Section 3 gives a detailed description of our multi-homography war** method, including multi-homography fitting, multi-labeling, image war** and blending. Section 4 demonstrates the experimental results. Section 5 concludes the paper.

2 Related Work

Refer to caption
Figure 2: Pipeline of our multi-homography war** method.

2.1 Spatially-varying Warps

Suppose a set of feature matches is given, some war** methods adopted piece-wise homographies as adaptive war** models where every local homography is determined via some weighting methods. Gao et al. [11] proposed a dual-homography (DH) war** model, where two representative homographies for two dominant planes (distant plane and ground plane) are estimated. Then the spatially-varying motion field is calculated by interpolating the two homographies as basis motions. Zheng et al. [39] further developed DH to handle the overlap** scene consisting of multiple dominant planes.

Zaragoza et al. [34] proposed an as-projective-as-possible (APAP) war** model to handle parallax. It generates the smooth spatially-varying motion field via fitting weighted correspondences in feature matches instead of interpolating basis motion. For each quad mesh, it estimates a location-dependent homography by weighted direct linear transform. Li et al. [20] and Lee et al. [17] improved the APAP model by estimating a weighed homography for each triangle and superpixel, respectively. Li et al. [19] proposed a robust elastic war** model to handle images with parallax. They formulate the image war** as the thin plate spline model with a simple radial basis function.

Other spatially-varying warps [4, 23, 21, 26] focus on alleviating the distortions in the non-overlap** regions between input images, they often combine with APAP or other war** models to handle the parallax issues. The above war** methods can generate much better alignment results, however, they often fail to align the objects having large parallax and abrupt depth changes due to the smoothness constraint.

Lee et al. [17] introduced the war** residual and estimated a discontinuous warp with “holes” to handle such issues. Lin et al. [25] proposed a method using a disparity map and multiple homographies to distinguish one background plane and multiple foreground objects and align them separately. The two methods both adopted iterative RANSAC to generate multiple homographies, while RANSAC can only provide sub-optimal model to fit the inliers, and the effect of the outliers is hard to be eliminated and the random process may result in unstable homography models. Besides, the performance of Lee’s method [17] depends on the accuracy of the war** residual and the superpixel segmentation. The “holes” are produced via a hand-crafted selection strategy, which depends on the parameter setting. Lin’s method [25] cannot handle the case of feature match deficiency in the foreground objects. With the remarkable performance of SAM, our method can provide a more robust and accurate discontinuous warp.

2.2 Mesh-based Warps

Another strategy to handle parallax is to represent the image as quad meshes and model the image war** as mesh deformation. Chen et al. [6] proposed a mesh-based method with global similarity prior to address the distortions of unnatural rotation and scaling. They calculate the alignment term based on APAP-generated vertices correspondences instead of feature matches since the former is much more uniformly distributed. Li et. al [18] and Zhang et. al [38] further developed the global similarity prior to handle the single image with multiple dominant regions.

Zhang et. al [36] proposed to enforce a straightness constraint into the mesh deformation. Liao and Li [22] proposed a single-perspective war** to mitigate the perspective distortions in the non-overlap** region. Jia et. al [15] further leveraged the line-point consistency to provide more accurate alignment and preserve both local and global lines. Chen et. al [5] proposed an angle-consistent war** that integrates angle features of key points into homography estimation and mesh deformation. Du et. al [9] proposed a structure-preserving war** model to preserve the large-scale structures reflected by straight lines or curves.

The mesh-based warps focus on exploring different geometrical or content constraints to smoothly extrapolate the war** from overlap** region to non-overlap** region. For large parallax inputs, they easily fail.

2.3 Other Warps

To handle large parallax, a strategy called seam-driven is introduced which combines war** models with subsequent composition steps to generate visually pleasing results instead of geometrically accurate alignment. Gao et al. [12] proposed to generate multiple homography candidates, and the homography bringing in the best seam-cutting result is chosen. Zhang et al. [35] and Lin et al. [24] further developed the strategy by introducing mesh-based war** candidates which provide better alignment. The strategy aims to provide visually pleasing results, instead of the most geometrically accurate alignments. Besides, the war** candidates still need to align certain local regions well such that a plausible seam-cutting can be found.

Some learning-based methods [28, 29] applied deep homography estimation [8] to stitch images. Nie et al. [30] further introduced an unsupervised stitching method which proposed a flexible warp to model the image alignment from global homography to local thin-plate spline motion. However, the flexible warp with smoothness constraint still fails to handle images with large parallax.

3 Method

In this section, we propose our image stitching method via multi-homography war**, which mainly includes three steps: multi-homography fitting, multi-labeling for image contents, image war** and blending. The pipeline of our method is shown in Fig. 2.

3.1 Multi-homography Fitting

Given the target image and the reference image, we first segment the target image into numerous contents {Ck}k=1Msuperscriptsubscriptsubscript𝐶𝑘𝑘1𝑀\{C_{k}\}_{k=1}^{M}{ italic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT via SAM [16]. Then we detect the feature matches and remove the outliers via RANSAC [10]. Note that the feature matches cannot be fitted with a single homography due to the parallax. For this purpose, we employ the fundamental matrix to select the inliers based on the observation that a correct feature match (𝐩,𝐪)𝐩𝐪(\mathbf{p},\mathbf{q})( bold_p , bold_q ) should satisfy:

𝐩~TF𝐪~<ϵ,normsuperscript~𝐩𝑇subscript𝐹~𝐪italic-ϵ\|\tilde{\mathbf{p}}^{T}F_{*}\tilde{\mathbf{q}}\|<\epsilon,∥ over~ start_ARG bold_p end_ARG start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_F start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT over~ start_ARG bold_q end_ARG ∥ < italic_ϵ ,

where 𝐩~~𝐩\tilde{\mathbf{p}}over~ start_ARG bold_p end_ARG (𝐪~)~𝐪(\tilde{\mathbf{q}})( over~ start_ARG bold_q end_ARG ) is the homogeneous coordinate of the feature point 𝐩𝐩\mathbf{p}bold_p (𝐪)𝐪(\mathbf{q})( bold_q ), and Fsubscript𝐹F_{*}italic_F start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT is obtained via solving the following equation using the direct linear transform (DLT) method:

F=argminF(𝐩,𝐪)𝐩~TF𝐪~.subscript𝐹subscript𝐹subscript𝐩𝐪normsuperscript~𝐩𝑇𝐹~𝐪F_{*}=\arg\min_{F}\sum_{(\mathbf{p},\mathbf{q})}\|\tilde{\mathbf{p}}^{T}F% \tilde{\mathbf{q}}\|.italic_F start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT = roman_arg roman_min start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT ( bold_p , bold_q ) end_POSTSUBSCRIPT ∥ over~ start_ARG bold_p end_ARG start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_F over~ start_ARG bold_q end_ARG ∥ . (1)

The operation results in a set of feature matches {(𝐩i,𝐪i)}i=1Nsuperscriptsubscriptsubscript𝐩𝑖subscript𝐪𝑖𝑖1𝑁\{(\mathbf{p}_{i},\mathbf{q}_{i})\}_{i=1}^{N}{ ( bold_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT between the input two images. Then we employ the multi-model fitting method [14] to partition the set of feature matches into several subsets, each of which fits a single homography. In particular, we aim to minimize the following energy function:

E()=i=1ND(𝐩i,H𝐩i)+λ(𝐩i,𝐩j)𝒩δ(H𝐩iH𝐩j)+β||,𝐸superscriptsubscript𝑖1𝑁Dsubscript𝐩𝑖subscript𝐻subscript𝐩𝑖𝜆subscriptsubscript𝐩𝑖subscript𝐩𝑗𝒩𝛿subscript𝐻subscript𝐩𝑖subscript𝐻subscript𝐩𝑗𝛽E(\mathcal{H})=\sum_{i=1}^{N}\mathrm{D}(\mathbf{p}_{i},H_{\mathbf{p}_{i}})+% \lambda\sum_{(\mathbf{p}_{i},\mathbf{p}_{j})\in\mathcal{N}}\delta(H_{\mathbf{p% }_{i}}\neq H_{\mathbf{p}_{j}})+\beta|\mathcal{H}|,italic_E ( caligraphic_H ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT roman_D ( bold_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_H start_POSTSUBSCRIPT bold_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) + italic_λ ∑ start_POSTSUBSCRIPT ( bold_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ∈ caligraphic_N end_POSTSUBSCRIPT italic_δ ( italic_H start_POSTSUBSCRIPT bold_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ≠ italic_H start_POSTSUBSCRIPT bold_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) + italic_β | caligraphic_H | , (2)

where ={H𝐩i|𝐩iP}conditional-setsubscript𝐻subscript𝐩𝑖subscript𝐩𝑖𝑃\mathcal{H}=\{H_{\mathbf{p}_{i}}\ |\ \mathbf{p}_{i}\in P\}caligraphic_H = { italic_H start_POSTSUBSCRIPT bold_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT | bold_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_P } is the assignment of homography models to feature points P={𝐩i}i=1N𝑃superscriptsubscriptsubscript𝐩𝑖𝑖1𝑁P=\{\mathbf{p}_{i}\}_{i=1}^{N}italic_P = { bold_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT. We next concretely explain the involved terms.

The first term D(,)D\mathrm{D}(\cdot,\cdot)roman_D ( ⋅ , ⋅ ) measures the error of the feature point 𝐩isubscript𝐩𝑖\mathbf{p}_{i}bold_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT fitted by the given homography H𝐩isubscript𝐻subscript𝐩𝑖H_{\mathbf{p}_{i}}italic_H start_POSTSUBSCRIPT bold_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT, which is defined as the symmetric transfer error (STE):

D(𝐩i,H𝐩i)=H𝐩i𝐩i𝐪i+H𝐩i1𝐪i𝐩i.Dsubscript𝐩𝑖subscript𝐻subscript𝐩𝑖normsubscript𝐻subscript𝐩𝑖subscript𝐩𝑖subscript𝐪𝑖normsubscriptsuperscript𝐻1subscript𝐩𝑖subscript𝐪𝑖subscript𝐩𝑖\mathrm{D}(\mathbf{p}_{i},H_{\mathbf{p}_{i}})=\|H_{\mathbf{p}_{i}}\mathbf{p}_{% i}-\mathbf{q}_{i}\|+\|H^{-1}_{\mathbf{p}_{i}}\mathbf{q}_{i}-\mathbf{p}_{i}\|.roman_D ( bold_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_H start_POSTSUBSCRIPT bold_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) = ∥ italic_H start_POSTSUBSCRIPT bold_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT bold_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - bold_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ + ∥ italic_H start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT bold_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - bold_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ . (3)

The homography model H𝐩isubscript𝐻subscript𝐩𝑖H_{\mathbf{p}_{i}}italic_H start_POSTSUBSCRIPT bold_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT is computed by minimizing the non-linear STE error of all the feature points fitted by H𝐩isubscript𝐻subscript𝐩𝑖H_{\mathbf{p}_{i}}italic_H start_POSTSUBSCRIPT bold_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT using the Levenberg-Marquardt method. The initial solution for the non-linear minimization is found using the DLT method.

It’s worth noting that, the fundamental matrix-based RANSAC is ineffective in removing all the outliers with the obstruction of image noise. To properly separate the outliers from inliers in the multi-homography fitting step, we define an extra “outlier” model \emptyset, where the model \emptyset has constant fidelity measure D(𝐩i,)=γ𝐷subscript𝐩𝑖𝛾D(\mathbf{p}_{i},\emptyset)=\gammaitalic_D ( bold_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , ∅ ) = italic_γ for all points 𝐩iPsubscript𝐩𝑖𝑃\mathbf{p}_{i}\in Pbold_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_P.

The second term indeed measures the smoothness between the neighboring feature points fitted by different homography models. As the segmented contents of SAM provide abundant semantic information, we therefore construct a better neighborhood system of the feature points in the target image with such characteristics. Notice that any two feature points in the same content are parallax-free and can be fitted via the same homography, and the smoothness term in Eq. 2 should be zero. To be specific, we define the neighborhood 𝒩𝒩\mathcal{N}caligraphic_N as:

𝒩={(𝐩i,𝐩j)|𝐩iCko,𝐩jCko,𝐩i𝐩j¯m}k,m𝒩subscriptconditional-setsubscript𝐩𝑖subscript𝐩𝑗formulae-sequencesubscript𝐩𝑖superscriptsubscript𝐶𝑘𝑜formulae-sequencesubscript𝐩𝑗superscriptsubscript𝐶𝑘𝑜¯subscript𝐩𝑖subscript𝐩𝑗subscript𝑚𝑘𝑚\mathcal{N}=\{(\mathbf{p}_{i},\mathbf{p}_{j})\ |\ \mathbf{p}_{i}\in C_{k}^{o},% \ \mathbf{p}_{j}\in C_{k}^{o},\ \overline{\mathbf{p}_{i}\mathbf{p}_{j}}\in% \triangle_{m}\}_{k,m}caligraphic_N = { ( bold_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) | bold_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT , bold_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ italic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT , over¯ start_ARG bold_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG ∈ △ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k , italic_m end_POSTSUBSCRIPT (4)

where Ckosuperscriptsubscript𝐶𝑘𝑜C_{k}^{o}italic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT denotes the segmented content in the overlap** region O𝑂Oitalic_O, msubscript𝑚\triangle_{m}△ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT is the triangle generated by the Delaunay triangulation of the feature points. The above indicating function δ()𝛿\delta(\cdot)italic_δ ( ⋅ ) in Eq. 2 is 1 if the condition inside the parenthesis is true and 0 otherwise.

The third term obviously controls the number of homography models.

For the efficient optimization of the energy function in Eq. 2, the extension of the α𝛼\alphaitalic_α-expansion algorithm [7] is used. For an accurate starting state, we first initialize the homography models 0subscript0\mathcal{H}_{0}caligraphic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT by iteratively performing the homography-based RANSAC until the number of the rest of the feature matches is smaller than a predefined threshold (50 in our experiment).

Refer to caption
Refer to caption
(a)
Refer to caption
Refer to caption
(b)
Figure 3: Partition results of the feature points by our multi-homography fitting algorithm w/o and w/ SAM. (a) Neighborhood system defined only by Delaunay triangulation and the partition result. (b) Neighborhood system defined by Delaunay triangulation & SAM, and the partition result. Introducing SAM to define the neighborhood system boosts the model performance.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 4: Labeling results on different image pairs. Top: SAM results. Middle: labeling results for overlap** regions. Each color corresponds to a label and fits a homography. Bottom: average photometric errors for each labeling content, where errors are shown as a hot map.

For ease of understanding, we here visualize one multi-homography fitting result on the test case [35] in Fig. 2. Ten subsets of feature matches are partitioned, each fitting a homography. Among them, the first homography successfully aligns the background buildings. For the foreground scene, the second and the sixth fitted homographies align the tower and the red statue, respectively. To verify the effectiveness of SAM-based contents, we compare the fitting results calculated by defining the neighborhood system w/o and w/ SAM-based contents. As shown in Fig. 3, the neighborhood system involving SAM can produce a more accurate partition of the feature points. Sec. 4.3 will provide a detailed discussion to prove its effectiveness.

3.2 Multi-labeling for Image Contents

Then, we use the obtained multiple homographies to label (align) the segmented contents of the target image.

Labeling for the overlap** region. We first calculate a global homography warp Hgsubscript𝐻𝑔H_{g}italic_H start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT using all the feature matches to construct the overlap** region. Then, for each content Ckosubscriptsuperscript𝐶𝑜𝑘C^{o}_{k}italic_C start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT in the overlap** region, we label it using the homography that aligns the reference image with the lowest photometric error, and the photometric error of the content Ckosubscriptsuperscript𝐶𝑜𝑘C^{o}_{k}italic_C start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT aligned by Hisubscript𝐻𝑖H_{i}italic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is calculated as:

e(Cko,Hi)=1|Cko|𝐩CkoIr(Hi(𝐩))It(𝐩),Hi.formulae-sequence𝑒superscriptsubscript𝐶𝑘𝑜subscript𝐻𝑖1subscriptsuperscript𝐶𝑜𝑘subscript𝐩subscriptsuperscript𝐶𝑜𝑘normsubscript𝐼𝑟subscript𝐻𝑖𝐩subscript𝐼𝑡𝐩subscript𝐻𝑖e(C_{k}^{o},H_{i})=\frac{1}{|C^{o}_{k}|}\sum_{\mathbf{p}\in C^{o}_{k}}\|I_{r}(% H_{i}(\mathbf{p}))-I_{t}(\mathbf{p})\|,\ H_{i}\in\mathcal{H}.italic_e ( italic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT , italic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = divide start_ARG 1 end_ARG start_ARG | italic_C start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT bold_p ∈ italic_C start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ italic_I start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( italic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_p ) ) - italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_p ) ∥ , italic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_H . (5)

where Hisubscript𝐻𝑖H_{i}italic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT maps the pixel coordinate 𝐩𝐩\mathbf{p}bold_p in the target image to the coordinate Hi(𝐩)subscript𝐻𝑖𝐩H_{i}(\mathbf{p})italic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_p ) in the reference image.

Labeling for the non-overlap** region. To smoothly extrapolate the warp from the overlap** region to the non-overlap** region, we linearize the multiple homographies in the overlap** region and utilize them to label the image contents in the non-overlap** region, which is inspired by [23]. Moreover, we use a similarity transformation to reduce the perspective distortion in the non-overlap** region. In particular, we use each subset of feature matches to calculate an individual similarity transformation and choose the one with the smallest rotation angle. Then, we uniformly sample R1subscript𝑅1R_{1}italic_R start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT anchor points {𝐚i}i=1R1superscriptsubscriptsubscript𝐚𝑖𝑖1subscript𝑅1\{\mathbf{a}_{i}\}_{i=1}^{R_{1}}{ bold_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT in the boundary of the overlap** region and R2subscript𝑅2R_{2}italic_R start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT anchor points {𝐛j}j=1R2superscriptsubscriptsubscript𝐛𝑗𝑗1subscript𝑅2\{\mathbf{b}_{j}\}_{j=1}^{R_{2}}{ bold_b start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT in the outermost boundary of the non-overlap** region. Then we partition the non-overlap** region into a grid of cells and take the center of each cell as 𝐜ksubscript𝐜𝑘\mathbf{c}_{k}bold_c start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT. The linearized homography for the cell centering at 𝐜ksubscript𝐜𝑘\mathbf{c}_{k}bold_c start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is calculated as

H𝐜ksubscript𝐻subscript𝐜𝑘\displaystyle H_{\mathbf{c}_{k}}italic_H start_POSTSUBSCRIPT bold_c start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT =i=1R1αi(H𝐚i+JH(𝐚i)(𝐜k𝐚i))absentsuperscriptsubscript𝑖1subscript𝑅1subscript𝛼𝑖subscript𝐻subscript𝐚𝑖subscript𝐽𝐻subscript𝐚𝑖subscript𝐜𝑘subscript𝐚𝑖\displaystyle=\sum_{i=1}^{R_{1}}\alpha_{i}\left(H_{\mathbf{a}_{i}}+J_{H}(% \mathbf{a}_{i})(\mathbf{c}_{k}-\mathbf{a}_{i})\right)= ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_H start_POSTSUBSCRIPT bold_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT + italic_J start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ( bold_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ( bold_c start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - bold_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) )
+j=1R2αj(𝐒+J𝐒(𝐛j)(𝐜k𝐛j))superscriptsubscript𝑗1subscript𝑅2subscript𝛼𝑗𝐒subscript𝐽𝐒subscript𝐛𝑗subscript𝐜𝑘subscript𝐛𝑗\displaystyle+\sum_{j=1}^{R_{2}}\alpha_{j}\left(\mathbf{S}+J_{\mathbf{S}}(% \mathbf{b}_{j})(\mathbf{c}_{k}-\mathbf{b}_{j})\right)+ ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( bold_S + italic_J start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT ( bold_b start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ( bold_c start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - bold_b start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ) (6)

where JH(𝐚i)subscript𝐽𝐻subscript𝐚𝑖J_{H}(\mathbf{a}_{i})italic_J start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ( bold_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) is the Jacobian of the homography H𝐚isubscript𝐻subscript𝐚𝑖H_{\mathbf{a}_{i}}italic_H start_POSTSUBSCRIPT bold_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT at the point 𝐚isubscript𝐚𝑖\mathbf{a}_{i}bold_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, J𝐒(𝐛j)subscript𝐽𝐒subscript𝐛𝑗J_{\mathbf{S}}(\mathbf{b}_{j})italic_J start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT ( bold_b start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) is the Jacobian of the optimal similarity transformation 𝐒𝐒\mathbf{S}bold_S at the point 𝐛jsubscript𝐛𝑗\mathbf{b}_{j}bold_b start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT. αisubscript𝛼𝑖\alpha_{i}italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT (αjsubscript𝛼𝑗\alpha_{j}italic_α start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT) is a function of 𝐜ksubscript𝐜𝑘\mathbf{c}_{k}bold_c start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and 𝐚isubscript𝐚𝑖\mathbf{a}_{i}bold_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT (𝐛jsubscript𝐛𝑗\mathbf{b}_{j}bold_b start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT), defined by Student’s t-weighting:

αi=(1+𝐜k𝐚i2ν)(ν+1)2subscript𝛼𝑖superscript1superscriptnormsubscript𝐜𝑘subscript𝐚𝑖2𝜈𝜈12\alpha_{i}=\left(1+\frac{\|\mathbf{c}_{k}-\mathbf{a}_{i}\|^{2}}{\nu}\right)^{% \frac{-(\nu+1)}{2}}italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ( 1 + divide start_ARG ∥ bold_c start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - bold_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_ν end_ARG ) start_POSTSUPERSCRIPT divide start_ARG - ( italic_ν + 1 ) end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT (7)
αj=(1+𝐜k𝐛j2ν)(ν+1)2subscript𝛼𝑗superscript1superscriptnormsubscript𝐜𝑘subscript𝐛𝑗2𝜈𝜈12\alpha_{j}=\left(1+\frac{\|\mathbf{c}_{k}-\mathbf{b}_{j}\|^{2}}{\nu}\right)^{% \frac{-(\nu+1)}{2}}italic_α start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = ( 1 + divide start_ARG ∥ bold_c start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - bold_b start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_ν end_ARG ) start_POSTSUPERSCRIPT divide start_ARG - ( italic_ν + 1 ) end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT (8)

where ν𝜈\nuitalic_ν controls the decaying rate of αisubscript𝛼𝑖\alpha_{i}italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT (αjsubscript𝛼𝑗\alpha_{j}italic_α start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT). Specifically, αisubscript𝛼𝑖\alpha_{i}italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT assigns a higher weight to the cell in the neighborhood of the anchor point 𝐚isubscript𝐚𝑖\mathbf{a}_{i}bold_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, making the linearized homography H𝐜ksubscript𝐻subscript𝐜𝑘H_{\mathbf{c}_{k}}italic_H start_POSTSUBSCRIPT bold_c start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT similar to H𝐚isubscript𝐻subscript𝐚𝑖H_{\mathbf{a}_{i}}italic_H start_POSTSUBSCRIPT bold_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT and gradually transform to 𝐒𝐒\mathbf{S}bold_S. Thus, the multiple homographies can be smoothly extrapolated to the non-overlap** region. We calculate the linearized homographies for dense cells instead of the segmented contents in the non-overlap** region because the former can generate smooth and matched boundaries between the neighboring cells, while the segmented contents don’t.

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Figure 5: Different Labeling strategies for the non-overlap** region. (a) Labeling strategy which replaces 𝐒𝐒\mathbf{S}bold_S with global homography Hgsubscript𝐻𝑔H_{g}italic_H start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT. (b) Labeling strategy only using the first term of Eq. 6. (c) Labeling strategy which replaces dense cells with segmented contents by SAM. (d) Our final labeling strategy.

Recalling the visualization result in Fig. 2, we further demonstrate the multi-labeling result, where each color represents a label and corresponds to a homography. More labeling results are also exhibited in Fig. 4, which contains several SAM results, labeling results in the overlap** region, and the colored photometric errors calculated by Eq. (5). The labeling homography for each content in the overlap** region provides very small errors. We also show the stitching results using different labeling strategies for the non-overlap** region in Fig. 5. Including similarity transformation into homography linearization for dense cells reduces the perspective distortions and achieves the best stitching quality.

3.3 Image War** and Blending

The multi-homography fitting and multi-labeling provide multiple homographies for corresponding contents Ckosuperscriptsubscript𝐶𝑘𝑜C_{k}^{o}italic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT in the overlap** region and grid cells 𝐜ksubscript𝐜𝑘\mathbf{c}_{k}bold_c start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT in the non-overlap** region. Finally, we warp the target image to align with the reference image and composite them via linear blending. Here we use both forward and backward map** to generate the warped target and reference images.

The forward map** is capable of accurately map** each labeling object to the corresponding object in the reference image. While there exist float coordinate issues in the map** process. We address it by combining backward map** on each labeling region. Specifically, we use forward map** to calculate the canvas and backward texture map** for each labeling region to render the image.

Besides, due to the parallax between foreground objects and scenes, the input images may have different occlusion relations, which results in the “n-to-1” forward map** issue. We further deal with it by introducing a “Error-buffer” algorithm. Specifically, for n𝑛nitalic_n regions in the target image that are mapped to the same region in the reference image, we compute the n𝑛nitalic_n average photometric errors and warp the region with the smallest error to align with the reference image. Fig. 6 shows an example of the occlusion issue. With the introduction of the error-buffer algorithm, the warped target image is well-rendered with the same occlusion relation as the reference image.

Refer to caption
Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Figure 6: The occlusion issue in the texture map**. (a) Input images. (b) Warped image without error-buffer. (c) Warped image with error-buffer. Red arrows indicate the occlusion issue in warped images.

4 Experiments

Settings. We conduct comparative experiments of the proposed method on a variety of existing available datasets with large parallax, including 35 image pairs from Parallax [35], 24 image pairs from SEAGULL [24] and 14 image pairs from MR [13]. For complete comparisons, the state-of-the-art methods, including single homography (Baseline), APAP [34], SPHP [4], ANAP [23], GSP [6], REW [19], SPW [22], TFA [20], LPC [15] and UDIS++ [30] are involved. The model parameters of existing methods are set as the original papers. To highlight the alignment performance, all the war** results are composited using linear blending. In the following experiments, feature points are detected and matched using SIFT [27]. For our model parameters, we set λ=20,β=10,γ=200formulae-sequence𝜆20formulae-sequence𝛽10𝛾200\lambda=20,\beta=10,\gamma=200italic_λ = 20 , italic_β = 10 , italic_γ = 200 for multi-homography fitting, and ν𝜈\nuitalic_ν is set to 5 in Student’s t-weighting. The proposed method typically takes from 5 to 10 seconds111The inference time includes SAM, SIFT feature detection and matching, multi-homography war** and final linear blending. with a 3.6GHz CPU and 16GB RAM to stitch two images with a resolution of 1000×\times×750.

4.1 Quantitative Comparison

To evaluate the alignment performance of these war** methods, we introduce three Image Quality Assessment (IQA) metrics, PSNR, SSIM [32], and LPIPS [37], to compare the alignment quality. The three metrics are calculated based on the overlap** regions of warped images. The average metrics on the three datasets are shown in Table. 1. Note that in a few image pairs, some methods fail to produce any meaningful results, thus we don’t include these cases in their average metric calculation, while additionally counting them as the number of Failure Cases (# FC). Our method consistently achieves the best alignment performance by a large margin on the three datasets according to the involved metrics.

Table 1: Quantitative comparisons between SOTA war** methods on different challenge datasets. The best is marked in bold and the second best is in underline.
Parallax [35] SEAGULL [24] MR [13]
Method PSNR \uparrow SSIM \uparrow LPIPS \downarrow PSNR \uparrow SSIM \uparrow LPIPS \downarrow PSNR \uparrow SSIM \uparrow LPIPS \downarrow # FC
Baseline 15.37 0.614 0.297 15.93 0.592 0.313 14.57 0.565 0.325 3
APAP [34] 16.84 0.659 0.237 16.63 0.613 0.282 15.89 0.608 0.282 0
SPHP [4] 15.77 0.662 0.317 15.90 0.640 0.339 15.02 0.664 0.364 1
ANAP [23] 16.87 0.666 0.227 17.00 0.640 0.247 16.11 0.619 0.269 1
GSP [6] 17.30 0.698 0.215 17.34 0.676 0.228 15.28 0.621 0.274 0
REW [19] 17.37 0.700 0.233 16.78 0.650 0.290 14.78 0.587 0.339 0
SPW [22] 16.33 0.642 0.250 16.49 0.602 0.285 15.49 0.574 0.291 0
TFA [20] 16.00 0.637 0.319 16.63 0.650 0.293 14.45 0.567 0.366 1
LPC [15] 16.33 0.634 0.256 16.03 0.588 0.299 14.29 0.514 0.342 0
UDIS++ [30] 15.64 0.606 0.269 16.09 0.577 0.294 15.02 0.542 0.311 1
Ours 19.19 0.752 0.192 18.69 0.713 0.226 18.55 0.736 0.197 0

4.2 Panorama Comparison

Then we compare the final visual panoramas with these SOTA war** methods in Fig. 7, which contains five challenging image pairs. Large parallax occurs in the boundary between the foreground objects and background scenes. Existing war** methods indeed fail to align them due to the smoothness constraint, which results in undesirable ghosting artifacts in the blended panoramas. In contrast, with the utilization of the multiple homographies fitted on the segmented contents by SAM, our war** method aligns them simultaneously. Due to the limited space, we only show parts of the results. All the input image pairs and comparison results on the three datasets are provided in the supplementary material.

Refer to caption
Figure 7: Comparisons of the image stitching results obtained by our method with that of the SOTA war** methods. Only part of the war** methods are included due to the limited space. (Best view in color and zoom in)

4.3 Ablation Study

To validate the effectiveness of each module in our method, We conduct the ablation study on the three datasets and demonstrate the results in Table 2. Among them, “Ours+0subscript0\mathcal{H}_{0}caligraphic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT” is proposed to ablate the module of the multi-homography fitting step, and the initial homography models 0subscript0\mathcal{H}_{0}caligraphic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT are instead introduced here to label and align the segmented contents; “Ours+𝒩subscript𝒩\mathcal{N}_{\triangle}caligraphic_N start_POSTSUBSCRIPT △ end_POSTSUBSCRIPT” denotes that we only use Delaunay triangulation to define the neighborhood system in the multi-homography fitting; and “Ours+𝒮𝒮\mathcal{S}caligraphic_S” represents that we replace SAM by superpixel segmentation [1] to align different image contents; “Ours-eb” means that we do not use the error-buffer algorithm in the image war**. Experiments show that including SAM to define the neighborhood system and using our error-buffer algorithm improves the alignment performance, benefiting from the accurate semantic information. Despite that “Ours+0subscript0\mathcal{H}_{0}caligraphic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT” slightly improves some metrics in the three datasets, our final model is more stable and visually pleasing. The first three ablated methods all have several failure cases when conducted on the entire datasets. We demonstrate one visual comparison result in Fig. 8, using 0subscript0\mathcal{H}_{0}caligraphic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT may generate unnatural distortions in the non-overlap** region, using 𝒮𝒮\mathcal{S}caligraphic_S (superpixel) may have ghosting artifacts due to the inaccurate segmentation on the object boundary.

Table 2: Ablation study. The ablation methods have failing image pairs that are not included in the evaluation. The best is marked in bold and the second best is in underline.
Parallax [35] SEAGULL [24] MR [13]
Method PSNR \uparrow SSIM \uparrow LPIPS \downarrow PSNR \uparrow SSIM \uparrow LPIPS \downarrow PSNR \uparrow SSIM \uparrow LPIPS \downarrow # FC
Ours+0subscript0\mathcal{H}_{0}caligraphic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT 19.24 0.755 0.199 18.99 0.716 0.227 18.36 0.741 0.204 3
Ours+𝒩subscript𝒩\mathcal{N}_{\triangle}caligraphic_N start_POSTSUBSCRIPT △ end_POSTSUBSCRIPT 17.44 0.696 0.227 17.84 0.670 0.244 15.96 0.633 0.269 4
Ours+𝒮𝒮\mathcal{S}caligraphic_S 18.50 0.730 0.211 18.02 0.697 0.244 17.27 0.682 0.244 3
Ours--eb 18.75 0.746 0.202 18.29 0.702 0.240 18.11 0.732 0.202 0
Ours 19.19 0.752 0.192 18.69 0.713 0.226 18.55 0.736 0.197 0
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 8: Visually comparisons between different ablation methods. Red arrows indicate the ghosting artifacts and distortions. The PSNR, SSIM and LPIPS metrics for “Ours+0subscript0\mathcal{H}_{0}caligraphic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT”, “Ours+𝒩subscript𝒩\mathcal{N}_{\triangle}caligraphic_N start_POSTSUBSCRIPT △ end_POSTSUBSCRIPT”, “Ours+𝒮𝒮\mathcal{S}caligraphic_S”, “Ours-eb” and “Ours” are (18.09, 11.65, 16.50, 16.86, 17.76), (0.779, 0.493, 0.701, 0.742, 0.756), (0.178, 0.432, 0.230, 0.202, 0.177). (Best view in color and zoom in).
Refer to caption
Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Figure 9: Failure example of our method. (a) Input images. (b) The SAM result of the target image. (c) Feature matches after our multi-homography fitting algorithm. (d) Our final result. (Best view in color and zoom in)

4.4 Discussion

Our war** method calculates multiple homographies via the subsets of feature points. Note that the total number of feature points should not be too small, otherwise, the fitted homography may fail to align the segmented contents. Additionally, the homography linearization step only constrains 𝒞1superscript𝒞1\mathcal{C}^{1}caligraphic_C start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT smoothness of the extrapolation from overlap** to non-overlap** region. Fig. 9 shows a failure example of our method. It lacks a sufficient number of feature matches between the input images, such that our method only generates one single homography and the final result has severe misalignments. Using SAM to help construct region matches and line feature matches can improve the alignment accuracy. Besides, adopting higher continuity and other structure-preserving constraints will enable the war** results more visually pleasing. We leave these as the future work.

5 Conclusions

In this paper, we propose a multi-homography war** method for handling images with large parallax. We first segment the target image into semantic contents via SAM, and then partition the feature points into multiple subsets via the multi-homography fitting algorithm, where each subset fits a homography and aligns the part of the image contents well. Subsequently, we label each semantic content in the overlap** region using the best-fitting homography with the lowest photometric error. The best-fitting homographies are then linearized to the non-overlap** region to constrain a smooth warp. Experimental results demonstrate that our proposed method accurately aligns challenging images with large parallax, and yields a significantly better performance compared with the state-of-the-art image war** methods.

References

  • Achanta et al. [2012] Radhakrishna Achanta, Appu Shaji, Kevin Smith, Aurelien Lucchi, Pascal Fua, and Sabine Süsstrunk. Slic superpixels compared to state-of-the-art superpixel methods. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(11):2274–2282, 2012.
  • Brown and Lowe [2007] Matthew Brown and David G Lowe. Automatic panoramic image stitching using invariant features. Int. J. Comput. Vis., 74(1):59–73, 2007.
  • Chang et al. [2014a] C. H. Chang, C. J. Chen, and Y. Y. Chuang. Spatially-varying image warps for scene alignment. In ICPR, pages 64–69, 2014a.
  • Chang et al. [2014b] Che-Han Chang, Yuuki Sato, and Yung-Yu Chuang. Shape-preserving half-projective warps for image stitching. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pages 3254–3261, 2014b.
  • Chen et al. [2021] Yinqi Chen, Huicheng Zheng, Yiyan Ma, and Zhiwei Yan. Image stitching based on angle-consistent war**. Pattern Recognition, 117:107993, 2021.
  • Chen and Chuang [2016] Yu-Sheng Chen and Yung-Yu Chuang. Natural image stitching with the global similarity prior. In Proc. 14th Eur. Conf. Comput. Vision, pages 186–201, 2016.
  • Delong et al. [2012] Andrew Delong, Anton Osokin, Hossam N Isack, and Yuri Boykov. Fast approximate energy minimization with label costs. International journal of computer vision, 96:1–27, 2012.
  • DeTone et al. [2016] Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabinovich. Deep image homography estimation. arXiv preprint arXiv:1606.03798, 2016.
  • Du et al. [2022] Peng Du, Jifeng Ning, Jiguang Cui, Shaoli Huang, Xinchao Wang, and Jiaxin Wang. Geometric structure preserving warp for natural image stitching. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pages 3688–3696, 2022.
  • Fischler and Bolles [1981] Martin A Fischler and Robert C Bolles. Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM, 24(6):381–395, 1981.
  • Gao et al. [2011] Junhong Gao, Seon Joo Kim, and Michael S Brown. Constructing image panoramas using dual-homography war**. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pages 49–56, 2011.
  • Gao et al. [2013] Junhong Gao, Yu Li, Tat-Jun Chin, and Michael S Brown. Seam-driven image stitching. Eurographics, pages 45–48, 2013.
  • Herrmann et al. [2018] Charles Herrmann, Chen Wang, Richard Strong Bowen, Emil Keyder, Michael Krainin, Ce Liu, and Ramin Zabih. Robust image stitching with multiple registrations. In Proceedings of the European Conference on Computer Vision (ECCV), pages 53–67, 2018.
  • Isack and Boykov [2012] Hossam Isack and Yuri Boykov. Energy-based geometric multi-model fitting. International journal of computer vision, 97(2):123–147, 2012.
  • Jia et al. [2021] Qi Jia, ZhengJun Li, Xin Fan, Haotian Zhao, Shiyu Teng, Xinchen Ye, and Longin Jan Latecki. Leveraging line-point consistence to preserve structures for wide parallax image stitching. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pages 12186–12195, 2021.
  • Kirillov et al. [2023] Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. arXiv preprint arXiv:2304.02643, 2023.
  • Lee and Sim [2020] Kyu-Yul Lee and Jae-Young Sim. War** residual based image stitching for large parallax. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
  • Li et al. [2021] Aocheng Li, Jie Guo, and Yanwen Guo. Image stitching based on semantic planar region consensus. IEEE Transactions on Image Processing, 30:5545–5558, 2021.
  • Li et al. [2018a] **g Li, Zhengming Wang, Shiming Lai, Yong** Zhai, and Maojun Zhang. Parallax-tolerant image stitching based on robust elastic war**. IEEE Trans. Multimedia, 20(7):1672–1687, 2018a.
  • Li et al. [2019] **g Li, Baosong Deng, Rongfu Tang, Zhengming Wang, and Ye Yan. Local-adaptive image alignment based on triangular facet approximation. IEEE Transactions on Image Processing, 29:2356–2369, 2019.
  • Li et al. [2018b] N. Li, Y. Xu, and C. Wang. Quasi-homography warps in image stitching. IEEE Trans. Multimedia, 20(6):1365–1375, 2018b.
  • Liao and Li [2020] Tianli Liao and Nan Li. Single-perspective warps in natural image stitching. IEEE Transactions on Image Processing, 29(1):724–735, 2020.
  • Lin et al. [2015] Chung-Ching Lin, Sharathchandra U Pankanti, Karthikeyan Natesan Ramamurthy, and Aleksandr Y Aravkin. Adaptive as-natural-as-possible image stitching. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pages 1155–1163, 2015.
  • Lin et al. [2016] Kaimo Lin, Nianjuan Jiang, Loong-Fah Cheong, Minh Do, and Jiangbo Lu. Seagull: Seam-guided local alignment for parallax-tolerant image stitching. In Proc. 14th Eur. Conf. Comput. Vision, pages 370–385, 2016.
  • Lin et al. [2022] Mingyuan Lin, Tangbo Liu, Ying Li, Xinpeng Miao, and Chu He. Image stitching by disparity-guided multi-plane alignment. Signal Processing, 197:108534, 2022.
  • Liu and Chai [2019] Shiguang Liu and Qingpeng Chai. Shape-optimizing and illumination-smoothing image stitching. IEEE Transactions on Multimedia, 21(3):690–703, 2019.
  • Lowe [2004] David G Lowe. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2):91–110, 2004.
  • Nie et al. [2020] Lang Nie, Chunyu Lin, Kang Liao, Meiqin Liu, and Yao Zhao. A view-free image stitching network based on global homography. Journal of Visual Communication and Image Representation, 73:102950, 2020.
  • Nie et al. [2021] Lang Nie, Chunyu Lin, Kang Liao, Shuaicheng Liu, and Yao Zhao. Unsupervised deep image stitching: Reconstructing stitched features to images. IEEE Transactions on Image Processing, 30:6184–6197, 2021.
  • Nie et al. [2023] Lang Nie, Chunyu Lin, Kang Liao, Shuaicheng Liu, and Yao Zhao. Parallax-tolerant unsupervised deep image stitching. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7399–7408, 2023.
  • Szeliski [2006] Richard Szeliski. Image alignment and stitching: A tutorial. Foundations and Trends® in Computer Graphics and Vision, 2(1):1–104, 2006.
  • Wang et al. [2004] Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing, 13(4):600–612, 2004.
  • Xiang et al. [2018] Tian-Zhu Xiang, Gui-Song Xia, Xiang Bai, and Liangpei Zhang. Image stitching by line-guided local war** with global similarity constraint. Pattern Recognition, 83:481–497, 2018.
  • Zaragoza et al. [2014] Julio Zaragoza, Tat-Jun Chin, Quoc-Huy Tran, Michael S Brown, and David Suter. As-projective-as-possible image stitching with moving dlt. IEEE Trans. Pattern Anal. Mach. Intell., 7(36):1285–1298, 2014.
  • Zhang and Liu [2014] Fan Zhang and Feng Liu. Parallax-tolerant image stitching. In Proc. IEEE Conf. Comput. Vision Pattern Recognit., pages 3262–3269, 2014.
  • Zhang et al. [2016] Guofeng Zhang, Yi He, Weifeng Chen, Jiaya Jia, and Hujun Bao. Multi-viewpoint panorama construction with wide-baseline images. IEEE Trans. Image Processing, 25(7):3099–3111, 2016.
  • Zhang et al. [2018] Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018.
  • Zhang et al. [2021] Zhihao Zhang, Xianqiang Yang, and Chao Xu. Natural image stitching with layered war** constraint. IEEE Transactions on Multimedia, pages 1–1, 2021.
  • Zheng et al. [2019] J. Zheng, Y. Wang, H. Wang, B. Li, and H. M. Hu. A novel projective-consistent plane based image stitching method. IEEE Transactions on Multimedia, 21(10):2561–2575, 2019.