Parallax-tolerant Image Stitching via Segmentation-guided Multi-homography War**

Tianli Liao, Lei Li and Guangen Liu
College of Information Science and Engineering, Henan University of Technology
{tianli.liao, leili, lgendd_99}@haut.edu.cn Ce Wang
Hong Kong University of Science and Technology
[email protected] Nan Li
School of Mathematical Sciences, Shenzhen University
[email protected]

Abstract

Large parallax between images is an intractable issue in image stitching. Various war**-based methods are proposed to address it, yet the results are unsatisfactory. In this paper, we propose a novel image stitching method using multi-homography war** guided by image segmentation. Specifically, we leverage the Segment Anything Model to segment the target image into numerous contents and partition the feature points into multiple subsets via the energy-based multi-homography fitting algorithm. The multiple subsets of feature points are used to calculate the corresponding multiple homographies. For each segmented content in the overlap** region, we select its best-fitting homography with the lowest photometric error. For each segmented content in the non-overlap** region, we calculate a weighted combination of the linearized homographies. Finally, the target image is warped via the best-fitting homographies to align with the reference image, and the final panorama is generated via linear blending. Comprehensive experimental results on the public datasets demonstrate that our method provides the best alignment accuracy by a large margin, compared with the state-of-the-art methods. The source code is available at https://github.com/tlliao/multi-homo-warp.

1 Introduction

Image stitching is a widely used technology in various computer vision applications and consumer photography. For captured images with a narrow field-of-view (FoV), the aim is to construct the panorama with a wider FoV scene. Although, there are several academic and commercial stitching tools published, e.g., Adobe Photoshop Photomerge, Microsoft Image Composite Editor, and Autostitch [2], image stitching with large parallax is still an intractable problem. Various stitching methods are proposed to handle it using different strategies.

Most of the existing methods follow a similar pipeline with two steps [31]: Image war** and alignment are first followed by image composition and blending. Generally, the crucial image war** step determines the stitching quality, and the homography is a simple and commonly used war** model representing the planar transformation between the input scenes. However, it has the assumptions of the planar captured scenes or purely rotational camera motion. Violating such assumptions leads to an undesirable parallax between images, making the homography fail to deal with. As the case of Fig. 1(d) shows, severe ghosting artifacts are generated in the stitching results.

To alleviate the artifacts caused by parallax, adaptive war** models are proposed using either spatially-varying warps [11, 4, 17, 21, 20, 23, 26, 34, 39, 25] or mesh-based warps [3, 6, 36, 33, 22, 15, 18, 38, 5, 9]. In particular, the spatially varying warps aim to create a spatially varying motion field in the image domain. The warps provide better alignment in the overlap** region and can be smoothly extrapolated to the non-overlap** region. To formulate the war** model as the energy minimization of the mesh deformation, mesh-based warps partition the images into regular meshes. Then different energy functions are designed to maintain the smoothness of the warps, improving the alignment accuracy and preventing distortions. Designing the deep learning framework to address image stitching [30] is another way to handle the parallax, while it still has the smoothness constraint. For images with larger parallax, the warps with smoothness constraints easily fail in aligning the foreground objects and background scenes simultaneously, as shown in Fig. 1(e-k). Intuitively, aligning the objects and scenes separately by different homographies and fusing them will improve the alignment. Motivated by the remarkable segmentation performance of the segment anything model (SAM) [16], as shown in Fig. 1(b), we propose the multi-homography war** method which achieves the best alignment accuracy, as shown in Fig. 1(c).

In this paper, we propose a stitching method via multi-homography war** for images with large parallax. Firstly, we segment the target image into numerous contents via SAM and detect feature points from input images; followed by partitioning them into multiple subsets via the energy-based multi-homography fitting algorithm to generate multiple homographies. Then, for each segmented content in the overlap** region, we select (label) its best-fitting homography with the lowest photometric error. For each content in the non-overlap** region, we calculate (label) a weighted combination of the linearized homographies. Finally, the target image is warped based on the homographies and aligned with the reference image. Experimental results show that our method can accurately align different image contents with large parallax and outperforms the state-of-the-art war** methods by a large margin. The main contributions of our work are 3-fold as follows:

•

We propose a stitching method to integrate the SAM into the war** model generation and it achieves the best alignment accuracy.
•

The multi-homography fitting algorithm we designed can provide better homography models than the iterative RANSAC, especially for images with large parallax. Our algorithm is more stable and robust to the outliers of the feature matches.
•

We propose a parameter-free forward and backward war** strategy to warp different contents in the target image via multiple homographies and generate well-aligned results with correct occlusions (holes).

The rest of this paper is organized as follows. Section 2 introduces the related work of image stitching. Section 3 gives a detailed description of our multi-homography war** method, including multi-homography fitting, multi-labeling, image war** and blending. Section 4 demonstrates the experimental results. Section 5 concludes the paper.

2 Related Work

2.1 Spatially-varying Warps

Suppose a set of feature matches is given, some war** methods adopted piece-wise homographies as adaptive war** models where every local homography is determined via some weighting methods. Gao et al. [11] proposed a dual-homography (DH) war** model, where two representative homographies for two dominant planes (distant plane and ground plane) are estimated. Then the spatially-varying motion field is calculated by interpolating the two homographies as basis motions. Zheng et al. [39] further developed DH to handle the overlap** scene consisting of multiple dominant planes.

Zaragoza et al. [34] proposed an as-projective-as-possible (APAP) war** model to handle parallax. It generates the smooth spatially-varying motion field via fitting weighted correspondences in feature matches instead of interpolating basis motion. For each quad mesh, it estimates a location-dependent homography by weighted direct linear transform. Li et al. [20] and Lee et al. [17] improved the APAP model by estimating a weighed homography for each triangle and superpixel, respectively. Li et al. [19] proposed a robust elastic war** model to handle images with parallax. They formulate the image war** as the thin plate spline model with a simple radial basis function.

Other spatially-varying warps [4, 23, 21, 26] focus on alleviating the distortions in the non-overlap** regions between input images, they often combine with APAP or other war** models to handle the parallax issues. The above war** methods can generate much better alignment results, however, they often fail to align the objects having large parallax and abrupt depth changes due to the smoothness constraint.

Lee et al. [17] introduced the war** residual and estimated a discontinuous warp with “holes” to handle such issues. Lin et al. [25] proposed a method using a disparity map and multiple homographies to distinguish one background plane and multiple foreground objects and align them separately. The two methods both adopted iterative RANSAC to generate multiple homographies, while RANSAC can only provide sub-optimal model to fit the inliers, and the effect of the outliers is hard to be eliminated and the random process may result in unstable homography models. Besides, the performance of Lee’s method [17] depends on the accuracy of the war** residual and the superpixel segmentation. The “holes” are produced via a hand-crafted selection strategy, which depends on the parameter setting. Lin’s method [25] cannot handle the case of feature match deficiency in the foreground objects. With the remarkable performance of SAM, our method can provide a more robust and accurate discontinuous warp.

2.2 Mesh-based Warps

Another strategy to handle parallax is to represent the image as quad meshes and model the image war** as mesh deformation. Chen et al. [6] proposed a mesh-based method with global similarity prior to address the distortions of unnatural rotation and scaling. They calculate the alignment term based on APAP-generated vertices correspondences instead of feature matches since the former is much more uniformly distributed. Li et. al [18] and Zhang et. al [38] further developed the global similarity prior to handle the single image with multiple dominant regions.

Zhang et. al [36] proposed to enforce a straightness constraint into the mesh deformation. Liao and Li [22] proposed a single-perspective war** to mitigate the perspective distortions in the non-overlap** region. Jia et. al [15] further leveraged the line-point consistency to provide more accurate alignment and preserve both local and global lines. Chen et. al [5] proposed an angle-consistent war** that integrates angle features of key points into homography estimation and mesh deformation. Du et. al [9] proposed a structure-preserving war** model to preserve the large-scale structures reflected by straight lines or curves.

The mesh-based warps focus on exploring different geometrical or content constraints to smoothly extrapolate the war** from overlap** region to non-overlap** region. For large parallax inputs, they easily fail.

2.3 Other Warps

To handle large parallax, a strategy called seam-driven is introduced which combines war** models with subsequent composition steps to generate visually pleasing results instead of geometrically accurate alignment. Gao et al. [12] proposed to generate multiple homography candidates, and the homography bringing in the best seam-cutting result is chosen. Zhang et al. [35] and Lin et al. [24] further developed the strategy by introducing mesh-based war** candidates which provide better alignment. The strategy aims to provide visually pleasing results, instead of the most geometrically accurate alignments. Besides, the war** candidates still need to align certain local regions well such that a plausible seam-cutting can be found.

Some learning-based methods [28, 29] applied deep homography estimation [8] to stitch images. Nie et al. [30] further introduced an unsupervised stitching method which proposed a flexible warp to model the image alignment from global homography to local thin-plate spline motion. However, the flexible warp with smoothness constraint still fails to handle images with large parallax.

3 Method

In this section, we propose our image stitching method via multi-homography war**, which mainly includes three steps: multi-homography fitting, multi-labeling for image contents, image war** and blending. The pipeline of our method is shown in Fig. 2.

3.1 Multi-homography Fitting

Given the target image and the reference image, we first segment the target image into numerous contents $\{C_{k}\}_{k=1}^{M}$ via SAM [16]. Then we detect the feature matches and remove the outliers via RANSAC [10]. Note that the feature matches cannot be fitted with a single homography due to the parallax. For this purpose, we employ the fundamental matrix to select the inliers based on the observation that a correct feature match $(\mathbf{p},\mathbf{q})$ should satisfy:

\|\tilde{\mathbf{p}}^{T}F_{*}\tilde{\mathbf{q}}\|<\epsilon,

where $\tilde{\mathbf{p}}$ $(\tilde{\mathbf{q}})$ is the homogeneous coordinate of the feature point $\mathbf{p}$ $(\mathbf{q})$ , and $F_{*}$ is obtained via solving the following equation using the direct linear transform (DLT) method:

F_{*}=\arg\min_{F}\sum_{(\mathbf{p},\mathbf{q})}\|\tilde{\mathbf{p}}^{T}F% \tilde{\mathbf{q}}\|.

(1)

The operation results in a set of feature matches $\{(\mathbf{p}_{i},\mathbf{q}_{i})\}_{i=1}^{N}$ between the input two images. Then we employ the multi-model fitting method [14] to partition the set of feature matches into several subsets, each of which fits a single homography. In particular, we aim to minimize the following energy function:

E(\mathcal{H})=\sum_{i=1}^{N}\mathrm{D}(\mathbf{p}_{i},H_{\mathbf{p}_{i}})+% \lambda\sum_{(\mathbf{p}_{i},\mathbf{p}_{j})\in\mathcal{N}}\delta(H_{\mathbf{p% }_{i}}\neq H_{\mathbf{p}_{j}})+\beta|\mathcal{H}|,

(2)

where $\mathcal{H}=\{H_{\mathbf{p}_{i}}\ |\ \mathbf{p}_{i}\in P\}$ is the assignment of homography models to feature points $P=\{\mathbf{p}_{i}\}_{i=1}^{N}$ . We next concretely explain the involved terms.

The first term $\mathrm{D}(\cdot,\cdot)$ measures the error of the feature point $\mathbf{p}_{i}$ fitted by the given homography $H_{\mathbf{p}_{i}}$ , which is defined as the symmetric transfer error (STE):

\mathrm{D}(\mathbf{p}_{i},H_{\mathbf{p}_{i}})=\|H_{\mathbf{p}_{i}}\mathbf{p}_{% i}-\mathbf{q}_{i}\|+\|H^{-1}_{\mathbf{p}_{i}}\mathbf{q}_{i}-\mathbf{p}_{i}\|.

(3)

The homography model $H_{\mathbf{p}_{i}}$ is computed by minimizing the non-linear STE error of all the feature points fitted by $H_{\mathbf{p}_{i}}$ using the Levenberg-Marquardt method. The initial solution for the non-linear minimization is found using the DLT method.

It’s worth noting that, the fundamental matrix-based RANSAC is ineffective in removing all the outliers with the obstruction of image noise. To properly separate the outliers from inliers in the multi-homography fitting step, we define an extra “outlier” model $\emptyset$ , where the model $\emptyset$ has constant fidelity measure $D(\mathbf{p}_{i},\emptyset)=\gamma$ for all points $\mathbf{p}_{i}\in P$ .

The second term indeed measures the smoothness between the neighboring feature points fitted by different homography models. As the segmented contents of SAM provide abundant semantic information, we therefore construct a better neighborhood system of the feature points in the target image with such characteristics. Notice that any two feature points in the same content are parallax-free and can be fitted via the same homography, and the smoothness term in Eq. 2 should be zero. To be specific, we define the neighborhood $\mathcal{N}$ as:

\mathcal{N}=\{(\mathbf{p}_{i},\mathbf{p}_{j})\ |\ \mathbf{p}_{i}\in C_{k}^{o},% \ \mathbf{p}_{j}\in C_{k}^{o},\ \overline{\mathbf{p}_{i}\mathbf{p}_{j}}\in% \triangle_{m}\}_{k,m}

(4)

where $C_{k}^{o}$ denotes the segmented content in the overlap** region $O$ , $\triangle_{m}$ is the triangle generated by the Delaunay triangulation of the feature points. The above indicating function $\delta(\cdot)$ in Eq. 2 is 1 if the condition inside the parenthesis is true and 0 otherwise.

The third term obviously controls the number of homography models.

For the efficient optimization of the energy function in Eq. 2, the extension of the $\alpha$ -expansion algorithm [7] is used. For an accurate starting state, we first initialize the homography models $\mathcal{H}_{0}$ by iteratively performing the homography-based RANSAC until the number of the rest of the feature matches is smaller than a predefined threshold (50 in our experiment).

For ease of understanding, we here visualize one multi-homography fitting result on the test case [35] in Fig. 2. Ten subsets of feature matches are partitioned, each fitting a homography. Among them, the first homography successfully aligns the background buildings. For the foreground scene, the second and the sixth fitted homographies align the tower and the red statue, respectively. To verify the effectiveness of SAM-based contents, we compare the fitting results calculated by defining the neighborhood system w/o and w/ SAM-based contents. As shown in Fig. 3, the neighborhood system involving SAM can produce a more accurate partition of the feature points. Sec. 4.3 will provide a detailed discussion to prove its effectiveness.

3.2 Multi-labeling for Image Contents

Then, we use the obtained multiple homographies to label (align) the segmented contents of the target image.

Labeling for the overlap** region. We first calculate a global homography warp $H_{g}$ using all the feature matches to construct the overlap** region. Then, for each content $C^{o}_{k}$ in the overlap** region, we label it using the homography that aligns the reference image with the lowest photometric error, and the photometric error of the content $C^{o}_{k}$ aligned by $H_{i}$ is calculated as:

e(C_{k}^{o},H_{i})=\frac{1}{|C^{o}_{k}|}\sum_{\mathbf{p}\in C^{o}_{k}}\|I_{r}(% H_{i}(\mathbf{p}))-I_{t}(\mathbf{p})\|,\ H_{i}\in\mathcal{H}.

(5)

where $H_{i}$ maps the pixel coordinate $\mathbf{p}$ in the target image to the coordinate $H_{i}(\mathbf{p})$ in the reference image.

Labeling for the non-overlap** region. To smoothly extrapolate the warp from the overlap** region to the non-overlap** region, we linearize the multiple homographies in the overlap** region and utilize them to label the image contents in the non-overlap** region, which is inspired by [23]. Moreover, we use a similarity transformation to reduce the perspective distortion in the non-overlap** region. In particular, we use each subset of feature matches to calculate an individual similarity transformation and choose the one with the smallest rotation angle. Then, we uniformly sample $R_{1}$ anchor points $\{\mathbf{a}_{i}\}_{i=1}^{R_{1}}$ in the boundary of the overlap** region and $R_{2}$ anchor points $\{\mathbf{b}_{j}\}_{j=1}^{R_{2}}$ in the outermost boundary of the non-overlap** region. Then we partition the non-overlap** region into a grid of cells and take the center of each cell as $\mathbf{c}_{k}$ . The linearized homography for the cell centering at $\mathbf{c}_{k}$ is calculated as

	$\displaystyle H_{\mathbf{c}_{k}}$	$\displaystyle=\sum_{i=1}^{R_{1}}\alpha_{i}\left(H_{\mathbf{a}_{i}}+J_{H}(% \mathbf{a}_{i})(\mathbf{c}_{k}-\mathbf{a}_{i})\right)$
		$\displaystyle+\sum_{j=1}^{R_{2}}\alpha_{j}\left(\mathbf{S}+J_{\mathbf{S}}(% \mathbf{b}_{j})(\mathbf{c}_{k}-\mathbf{b}_{j})\right)$		(6)

where $J_{H}(\mathbf{a}_{i})$ is the Jacobian of the homography $H_{\mathbf{a}_{i}}$ at the point $\mathbf{a}_{i}$ , $J_{\mathbf{S}}(\mathbf{b}_{j})$ is the Jacobian of the optimal similarity transformation $\mathbf{S}$ at the point $\mathbf{b}_{j}$ . $\alpha_{i}$ ( $\alpha_{j}$ ) is a function of $\mathbf{c}_{k}$ and $\mathbf{a}_{i}$ ( $\mathbf{b}_{j}$ ), defined by Student’s t-weighting:

\alpha_{i}=\left(1+\frac{\|\mathbf{c}_{k}-\mathbf{a}_{i}\|^{2}}{\nu}\right)^{% \frac{-(\nu+1)}{2}}

(7)

\alpha_{j}=\left(1+\frac{\|\mathbf{c}_{k}-\mathbf{b}_{j}\|^{2}}{\nu}\right)^{% \frac{-(\nu+1)}{2}}

(8)

where $\nu$ controls the decaying rate of $\alpha_{i}$ ( $\alpha_{j}$ ). Specifically, $\alpha_{i}$ assigns a higher weight to the cell in the neighborhood of the anchor point $\mathbf{a}_{i}$ , making the linearized homography $H_{\mathbf{c}_{k}}$ similar to $H_{\mathbf{a}_{i}}$ and gradually transform to $\mathbf{S}$ . Thus, the multiple homographies can be smoothly extrapolated to the non-overlap** region. We calculate the linearized homographies for dense cells instead of the segmented contents in the non-overlap** region because the former can generate smooth and matched boundaries between the neighboring cells, while the segmented contents don’t.

Recalling the visualization result in Fig. 2, we further demonstrate the multi-labeling result, where each color represents a label and corresponds to a homography. More labeling results are also exhibited in Fig. 4, which contains several SAM results, labeling results in the overlap** region, and the colored photometric errors calculated by Eq. (5). The labeling homography for each content in the overlap** region provides very small errors. We also show the stitching results using different labeling strategies for the non-overlap** region in Fig. 5. Including similarity transformation into homography linearization for dense cells reduces the perspective distortions and achieves the best stitching quality.

3.3 Image War** and Blending

The multi-homography fitting and multi-labeling provide multiple homographies for corresponding contents $C_{k}^{o}$ in the overlap** region and grid cells $\mathbf{c}_{k}$ in the non-overlap** region. Finally, we warp the target image to align with the reference image and composite them via linear blending. Here we use both forward and backward map** to generate the warped target and reference images.

The forward map** is capable of accurately map** each labeling object to the corresponding object in the reference image. While there exist float coordinate issues in the map** process. We address it by combining backward map** on each labeling region. Specifically, we use forward map** to calculate the canvas and backward texture map** for each labeling region to render the image.

Besides, due to the parallax between foreground objects and scenes, the input images may have different occlusion relations, which results in the “n-to-1” forward map** issue. We further deal with it by introducing a “Error-buffer” algorithm. Specifically, for $n$ regions in the target image that are mapped to the same region in the reference image, we compute the $n$ average photometric errors and warp the region with the smallest error to align with the reference image. Fig. 6 shows an example of the occlusion issue. With the introduction of the error-buffer algorithm, the warped target image is well-rendered with the same occlusion relation as the reference image.

4 Experiments

Settings. We conduct comparative experiments of the proposed method on a variety of existing available datasets with large parallax, including 35 image pairs from Parallax [35], 24 image pairs from SEAGULL [24] and 14 image pairs from MR [13]. For complete comparisons, the state-of-the-art methods, including single homography (Baseline), APAP [34], SPHP [4], ANAP [23], GSP [6], REW [19], SPW [22], TFA [20], LPC [15] and UDIS++ [30] are involved. The model parameters of existing methods are set as the original papers. To highlight the alignment performance, all the war** results are composited using linear blending. In the following experiments, feature points are detected and matched using SIFT [27]. For our model parameters, we set $\lambda=20,\beta=10,\gamma=200$ for multi-homography fitting, and $\nu$ is set to 5 in Student’s t-weighting. The proposed method typically takes from 5 to 10 seconds¹¹1The inference time includes SAM, SIFT feature detection and matching, multi-homography war** and final linear blending. with a 3.6GHz CPU and 16GB RAM to stitch two images with a resolution of 1000 $\times$ 750.

4.1 Quantitative Comparison

To evaluate the alignment performance of these war** methods, we introduce three Image Quality Assessment (IQA) metrics, PSNR, SSIM [32], and LPIPS [37], to compare the alignment quality. The three metrics are calculated based on the overlap** regions of warped images. The average metrics on the three datasets are shown in Table. 1. Note that in a few image pairs, some methods fail to produce any meaningful results, thus we don’t include these cases in their average metric calculation, while additionally counting them as the number of Failure Cases (# FC). Our method consistently achieves the best alignment performance by a large margin on the three datasets according to the involved metrics.

Table 1: Quantitative comparisons between SOTA war** methods on different challenge datasets. The best is marked in bold and the second best is in underline.

	Parallax [35]			SEAGULL [24]			MR [13]
Method	PSNR $\uparrow$	SSIM $\uparrow$	LPIPS $\downarrow$	PSNR $\uparrow$	SSIM $\uparrow$	LPIPS $\downarrow$	PSNR $\uparrow$	SSIM $\uparrow$	LPIPS $\downarrow$	# FC
Baseline	15.37	0.614	0.297	15.93	0.592	0.313	14.57	0.565	0.325	3
APAP [34]	16.84	0.659	0.237	16.63	0.613	0.282	15.89	0.608	0.282	0
SPHP [4]	15.77	0.662	0.317	15.90	0.640	0.339	15.02	0.664	0.364	1
ANAP [23]	16.87	0.666	0.227	17.00	0.640	0.247	16.11	0.619	0.269	1
GSP [6]	17.30	0.698	0.215	17.34	0.676	0.228	15.28	0.621	0.274	0
REW [19]	17.37	0.700	0.233	16.78	0.650	0.290	14.78	0.587	0.339	0
SPW [22]	16.33	0.642	0.250	16.49	0.602	0.285	15.49	0.574	0.291	0
TFA [20]	16.00	0.637	0.319	16.63	0.650	0.293	14.45	0.567	0.366	1
LPC [15]	16.33	0.634	0.256	16.03	0.588	0.299	14.29	0.514	0.342	0
UDIS++ [30]	15.64	0.606	0.269	16.09	0.577	0.294	15.02	0.542	0.311	1
Ours	19.19	0.752	0.192	18.69	0.713	0.226	18.55	0.736	0.197	0

4.2 Panorama Comparison

Then we compare the final visual panoramas with these SOTA war** methods in Fig. 7, which contains five challenging image pairs. Large parallax occurs in the boundary between the foreground objects and background scenes. Existing war** methods indeed fail to align them due to the smoothness constraint, which results in undesirable ghosting artifacts in the blended panoramas. In contrast, with the utilization of the multiple homographies fitted on the segmented contents by SAM, our war** method aligns them simultaneously. Due to the limited space, we only show parts of the results. All the input image pairs and comparison results on the three datasets are provided in the supplementary material.

4.3 Ablation Study

To validate the effectiveness of each module in our method, We conduct the ablation study on the three datasets and demonstrate the results in Table 2. Among them, “Ours+ $\mathcal{H}_{0}$ ” is proposed to ablate the module of the multi-homography fitting step, and the initial homography models $\mathcal{H}_{0}$ are instead introduced here to label and align the segmented contents; “Ours+ $\mathcal{N}_{\triangle}$ ” denotes that we only use Delaunay triangulation to define the neighborhood system in the multi-homography fitting; and “Ours+ $\mathcal{S}$ ” represents that we replace SAM by superpixel segmentation [1] to align different image contents; “Ours-eb” means that we do not use the error-buffer algorithm in the image war**. Experiments show that including SAM to define the neighborhood system and using our error-buffer algorithm improves the alignment performance, benefiting from the accurate semantic information. Despite that “Ours+ $\mathcal{H}_{0}$ ” slightly improves some metrics in the three datasets, our final model is more stable and visually pleasing. The first three ablated methods all have several failure cases when conducted on the entire datasets. We demonstrate one visual comparison result in Fig. 8, using $\mathcal{H}_{0}$ may generate unnatural distortions in the non-overlap** region, using $\mathcal{S}$ (superpixel) may have ghosting artifacts due to the inaccurate segmentation on the object boundary.

Table 2: Ablation study. The ablation methods have failing image pairs that are not included in the evaluation. The best is marked in bold and the second best is in underline.

	Parallax [35]			SEAGULL [24]			MR [13]
Method	PSNR $\uparrow$	SSIM $\uparrow$	LPIPS $\downarrow$	PSNR $\uparrow$	SSIM $\uparrow$	LPIPS $\downarrow$	PSNR $\uparrow$	SSIM $\uparrow$	LPIPS $\downarrow$	# FC
Ours+ $\mathcal{H}_{0}$	19.24	0.755	0.199	18.99	0.716	0.227	18.36	0.741	0.204	3
Ours+ $\mathcal{N}_{\triangle}$	17.44	0.696	0.227	17.84	0.670	0.244	15.96	0.633	0.269	4
Ours+ $\mathcal{S}$	18.50	0.730	0.211	18.02	0.697	0.244	17.27	0.682	0.244	3
Ours $-$ eb	18.75	0.746	0.202	18.29	0.702	0.240	18.11	0.732	0.202	0
Ours	19.19	0.752	0.192	18.69	0.713	0.226	18.55	0.736	0.197	0

4.4 Discussion

Our war** method calculates multiple homographies via the subsets of feature points. Note that the total number of feature points should not be too small, otherwise, the fitted homography may fail to align the segmented contents. Additionally, the homography linearization step only constrains $\mathcal{C}^{1}$ smoothness of the extrapolation from overlap** to non-overlap** region. Fig. 9 shows a failure example of our method. It lacks a sufficient number of feature matches between the input images, such that our method only generates one single homography and the final result has severe misalignments. Using SAM to help construct region matches and line feature matches can improve the alignment accuracy. Besides, adopting higher continuity and other structure-preserving constraints will enable the war** results more visually pleasing. We leave these as the future work.

5 Conclusions

In this paper, we propose a multi-homography war** method for handling images with large parallax. We first segment the target image into semantic contents via SAM, and then partition the feature points into multiple subsets via the multi-homography fitting algorithm, where each subset fits a homography and aligns the part of the image contents well. Subsequently, we label each semantic content in the overlap** region using the best-fitting homography with the lowest photometric error. The best-fitting homographies are then linearized to the non-overlap** region to constrain a smooth warp. Experimental results demonstrate that our proposed method accurately aligns challenging images with large parallax, and yields a significantly better performance compared with the state-of-the-art image war** methods.

References

Achanta et al. [2012] Radhakrishna Achanta, Appu Shaji, Kevin Smith, Aurelien Lucchi, Pascal Fua, and Sabine Süsstrunk. Slic superpixels compared to state-of-the-art superpixel methods. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(11):2274–2282, 2012.
Brown and Lowe [2007] Matthew Brown and David G Lowe. Automatic panoramic image stitching using invariant features. Int. J. Comput. Vis., 74(1):59–73, 2007.
Chang et al. [2014a] C. H. Chang, C. J. Chen, and Y. Y. Chuang. Spatially-varying image warps for scene alignment. In ICPR, pages 64–69, 2014a.
Chang et al. [2014b] Che-Han Chang, Yuuki Sato, and Yung-Yu Chuang. Shape-preserving half-projective warps for image stitching. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pages 3254–3261, 2014b.
Chen et al. [2021] Yinqi Chen, Huicheng Zheng, Yiyan Ma, and Zhiwei Yan. Image stitching based on angle-consistent war**. Pattern Recognition, 117:107993, 2021.
Chen and Chuang [2016] Yu-Sheng Chen and Yung-Yu Chuang. Natural image stitching with the global similarity prior. In Proc. 14th Eur. Conf. Comput. Vision, pages 186–201, 2016.
Delong et al. [2012] Andrew Delong, Anton Osokin, Hossam N Isack, and Yuri Boykov. Fast approximate energy minimization with label costs. International journal of computer vision, 96:1–27, 2012.
DeTone et al. [2016] Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabinovich. Deep image homography estimation. arXiv preprint arXiv:1606.03798, 2016.
Du et al. [2022] Peng Du, Jifeng Ning, Jiguang Cui, Shaoli Huang, Xinchao Wang, and Jiaxin Wang. Geometric structure preserving warp for natural image stitching. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pages 3688–3696, 2022.
Fischler and Bolles [1981] Martin A Fischler and Robert C Bolles. Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM, 24(6):381–395, 1981.
Gao et al. [2011] Junhong Gao, Seon Joo Kim, and Michael S Brown. Constructing image panoramas using dual-homography war**. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pages 49–56, 2011.
Gao et al. [2013] Junhong Gao, Yu Li, Tat-Jun Chin, and Michael S Brown. Seam-driven image stitching. Eurographics, pages 45–48, 2013.
Herrmann et al. [2018] Charles Herrmann, Chen Wang, Richard Strong Bowen, Emil Keyder, Michael Krainin, Ce Liu, and Ramin Zabih. Robust image stitching with multiple registrations. In Proceedings of the European Conference on Computer Vision (ECCV), pages 53–67, 2018.
Isack and Boykov [2012] Hossam Isack and Yuri Boykov. Energy-based geometric multi-model fitting. International journal of computer vision, 97(2):123–147, 2012.
Jia et al. [2021] Qi Jia, ZhengJun Li, Xin Fan, Haotian Zhao, Shiyu Teng, Xinchen Ye, and Longin Jan Latecki. Leveraging line-point consistence to preserve structures for wide parallax image stitching. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pages 12186–12195, 2021.
Kirillov et al. [2023] Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. arXiv preprint arXiv:2304.02643, 2023.
Lee and Sim [2020] Kyu-Yul Lee and Jae-Young Sim. War** residual based image stitching for large parallax. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
Li et al. [2021] Aocheng Li, Jie Guo, and Yanwen Guo. Image stitching based on semantic planar region consensus. IEEE Transactions on Image Processing, 30:5545–5558, 2021.
Li et al. [2018a] **g Li, Zhengming Wang, Shiming Lai, Yong** Zhai, and Maojun Zhang. Parallax-tolerant image stitching based on robust elastic war**. IEEE Trans. Multimedia, 20(7):1672–1687, 2018a.
Li et al. [2019] **g Li, Baosong Deng, Rongfu Tang, Zhengming Wang, and Ye Yan. Local-adaptive image alignment based on triangular facet approximation. IEEE Transactions on Image Processing, 29:2356–2369, 2019.
Li et al. [2018b] N. Li, Y. Xu, and C. Wang. Quasi-homography warps in image stitching. IEEE Trans. Multimedia, 20(6):1365–1375, 2018b.
Liao and Li [2020] Tianli Liao and Nan Li. Single-perspective warps in natural image stitching. IEEE Transactions on Image Processing, 29(1):724–735, 2020.
Lin et al. [2015] Chung-Ching Lin, Sharathchandra U Pankanti, Karthikeyan Natesan Ramamurthy, and Aleksandr Y Aravkin. Adaptive as-natural-as-possible image stitching. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pages 1155–1163, 2015.
Lin et al. [2016] Kaimo Lin, Nianjuan Jiang, Loong-Fah Cheong, Minh Do, and Jiangbo Lu. Seagull: Seam-guided local alignment for parallax-tolerant image stitching. In Proc. 14th Eur. Conf. Comput. Vision, pages 370–385, 2016.
Lin et al. [2022] Mingyuan Lin, Tangbo Liu, Ying Li, Xinpeng Miao, and Chu He. Image stitching by disparity-guided multi-plane alignment. Signal Processing, 197:108534, 2022.
Liu and Chai [2019] Shiguang Liu and Qingpeng Chai. Shape-optimizing and illumination-smoothing image stitching. IEEE Transactions on Multimedia, 21(3):690–703, 2019.
Lowe [2004] David G Lowe. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2):91–110, 2004.
Nie et al. [2020] Lang Nie, Chunyu Lin, Kang Liao, Meiqin Liu, and Yao Zhao. A view-free image stitching network based on global homography. Journal of Visual Communication and Image Representation, 73:102950, 2020.
Nie et al. [2021] Lang Nie, Chunyu Lin, Kang Liao, Shuaicheng Liu, and Yao Zhao. Unsupervised deep image stitching: Reconstructing stitched features to images. IEEE Transactions on Image Processing, 30:6184–6197, 2021.
Nie et al. [2023] Lang Nie, Chunyu Lin, Kang Liao, Shuaicheng Liu, and Yao Zhao. Parallax-tolerant unsupervised deep image stitching. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7399–7408, 2023.
Szeliski [2006] Richard Szeliski. Image alignment and stitching: A tutorial. Foundations and Trends® in Computer Graphics and Vision, 2(1):1–104, 2006.
Wang et al. [2004] Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing, 13(4):600–612, 2004.
Xiang et al. [2018] Tian-Zhu Xiang, Gui-Song Xia, Xiang Bai, and Liangpei Zhang. Image stitching by line-guided local war** with global similarity constraint. Pattern Recognition, 83:481–497, 2018.
Zaragoza et al. [2014] Julio Zaragoza, Tat-Jun Chin, Quoc-Huy Tran, Michael S Brown, and David Suter. As-projective-as-possible image stitching with moving dlt. IEEE Trans. Pattern Anal. Mach. Intell., 7(36):1285–1298, 2014.
Zhang and Liu [2014] Fan Zhang and Feng Liu. Parallax-tolerant image stitching. In Proc. IEEE Conf. Comput. Vision Pattern Recognit., pages 3262–3269, 2014.
Zhang et al. [2016] Guofeng Zhang, Yi He, Weifeng Chen, Jiaya Jia, and Hujun Bao. Multi-viewpoint panorama construction with wide-baseline images. IEEE Trans. Image Processing, 25(7):3099–3111, 2016.
Zhang et al. [2018] Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018.
Zhang et al. [2021] Zhihao Zhang, Xianqiang Yang, and Chao Xu. Natural image stitching with layered war** constraint. IEEE Transactions on Multimedia, pages 1–1, 2021.
Zheng et al. [2019] J. Zheng, Y. Wang, H. Wang, B. Li, and H. M. Hu. A novel projective-consistent plane based image stitching method. IEEE Transactions on Multimedia, 21(10):2561–2575, 2019.