LightStereo: Channel Boost Is All Your Need for Efficient 2D Cost Aggregation

Xianda Guo1,, Chenming Zhang2,3,, Dujun Nie4,5, Wenzhao Zheng6,
Youmin Zhang7,8, Long Chen2,3,4,
1 School of Computer Science, Wuhan University
2 Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University
    3 Waytous     4 Institute of Automation, Chinese Academy of Sciences    5Metoak
    6 University of California, Berkeley    7 University of Bologna     8 Rock Universe
Abstract

We present LightStereo, a cutting-edge stereo-matching network crafted to accelerate the matching process. Departing from conventional methodologies that rely on aggregating computationally intensive 4D costs, LightStereo adopts the 3D cost volume as a lightweight alternative. While similar approaches have been explored previously, our breakthrough lies in enhancing performance through a dedicated focus on the channel dimension of the 3D cost volume, where the distribution of matching costs is encapsulated. Our exhaustive exploration has yielded plenty of strategies to amplify the capacity of the pivotal dimension, ensuring both precision and efficiency. We compare the proposed LightStereo with existing state-of-the-art methods across various benchmarks, which demonstrate its superior performance in speed, accuracy, and resource utilization. LightStereo achieves a competitive EPE metric in the SceneFlow datasets while demanding a minimum of only 22 GFLOPs, with an inference time of just 17 ms. Our comprehensive analysis reveals the effect of 2D cost aggregation for stereo matching, paving the way for real-world applications of efficient stereo systems. Code will be available at https://github.com/XiandaGuo/OpenStereo.

Refer to caption
Figure 1: Performence vs. Time on SceneFlow(Left) and KITTI15(Right) datasets.
11footnotetext: These authors contributed to the work equally.22footnotetext: Corresponding authors:[email protected]

1 Introduction

Stereo matching is a pivotal task in computer vision, which aims to ascertain correspondences between points in stereo image pairs to compute depth information. This process underpins numerous applications, including autonomous driving, robotic navigation, and augmented reality. Despite substantial advancements, achieving real-time stereo matching without sacrificing accuracy or efficiency remains a formidable challenge, especially on resource-constrained platforms.

Recent developments gcnet ; psmnet2018 ; gwcnet2019 ; ganet2019 ; dsmnet2020 ; gu2020cascade ; leastereo ; wang2021pvstereo ; song2021adastereo ; Crestereo ; liu2022local ; acvnet ; nie2019multi ; guo2023openstereo have primarily focused on leveraging deep learning techniques for accurate stereo matching. Most methods are based on 3D CNN for cost aggregation. The pioneer is GCNet gcnet which jointly learns geometry and context by employing 3D CNN for cost aggregation. Due to the disparity dimension being restored in the constructed 4D cost volume, the network can perform exhaustively aggregation for accurate matching. Following works psmnet2018 ; gwcnet2019 also adopt this diagram and make notable advancements in performance. To mitigate the expenses in memory and computation of 3D cost aggregation, GANet ganet2019 proposes semi-global guided aggregation and local guided aggregation modules to replace 3D CNNs. LEAStereo leastereo utilizes a neural architecture search approach to automatically discover efficient and effective 3D cost aggregation architectures in stereo matching. CoEx bangunharcana2021coex employs a guided cost volume excitation approach to facilitate real-time stereo matching with enhanced accuracy. However, the use of 3D CNNs for cost aggregation in CoEx presents challenges for deployment on edge devices due to their computational demands. Another line of work is leveraging iterative refinement at multiple levels raftstereo ; xu2023iterative ; guo2023openstereo ; SelectiveStereo to refine disparity prediction on constructed all-pair 3D cost volume. Nevertheless, the overall runtime of these iterative-based methods is over 100ms on a custom GPU.

There are also efforts wang2020fadnet ; wang2021fadnet++ ; shamsafar2022mobilestereonet ; xu2020aanet ; xu2023cgi focusing on lightweight design for stereo matching. AANet xu2020aanet constructs a 3D cost volume by correlating the left and right images and introduces intra-scale and cross-scale 2D cost aggregation modules to enhance the efficiency and accuracy of cost aggregation. MobilenetStereo-2D shamsafar2022mobilestereonet introduces MobileNet mijwil2023mobilenetv1 ; sandler2018mobilenetv2 blocks to reduce the cost, but the performance is far from satisfaction (EPE 1.14). Overall, these methods based on 2D cost aggregation perform poorly. Cost aggregation is critical to accuracy and efficiency in stereo matching yet existing methods require a compromise between accuracy and speed. However, existing methodologies often necessitate a trade-off between accuracy and processing speed. This paper poses the question: Is it possible to design a lightweight 2D encoder-decoder aggregation net to achieve precise disparity estimation?

From the perspective of cost aggregation in stereo matching, focusing on the disparity channel dimension presents several advantages. Firstly, it allows for more direct modeling of the disparities between corresponding image points, which is crucial for accurate disparity estimation. By focusing on this dimension, it becomes possible to more effectively capture and assimilate the critical information needed for robust stereo matching. In this paper, we propose LightStereo as a positive solution. We explore 2D cost aggregation for stereo matching and leverage inverted residual blocks to enhance both accuracy and computational efficiency. The model is specifically designed to address the challenges of real-time stereo vision applications, focusing on reducing computational demands without compromising the quality of disparity estimation. Specifically, we utilize inverted residual blocks for 2D cost aggregation in stereo matching, which focuses on the disparity channel dimension of the 3D cost volume rather than the dimension of height and width. In inverted residual blocks, the expansion phase increases the channel dimension, allowing the network to learn richer features at a reduced computational cost before compressing them back. In addition, inspired by the effectiveness of large kernel and strip convolutions in image segmentation peng2017large ; hou2020strip ; guo2022segnext , we propose a Multi-Scale Convolutional Attention Module (MSCA) module for enhancing cost aggregation by extracting features from the left image. By leveraging multi-scale image features to excite the channel dimension of the cost volume, we utilize semantic information inherent in the images (such as object-level semantic details) to guide the cost aggregation process. When encountering discontinuities in disparity, the network halts propagation.

Our main contributions are as follows:

  • We propose LightStereo, which achieves a competitive epe in the SceneFLow sceneflow datasets while demanding a minimum of only 22 Gflops with an inference time of 17 ms.

  • We propose inverted residual blocks for 2D cost aggregation in stereo matching.

  • We propose the MSCA module for enhancing cost aggregation by extracting features from the left image.

  • We verify the effectiveness of LightStereo, achieving SOTA performance on the Sceneflow sceneflow and KITTI kitti2012 ; kitti2015 benchmark within the lightweight stereo-matching methods.

2 Related Work

Stereo matching, which predicts disparities (depth) from stereo images, can be classified into Conventional and deep-learning-based methods. Regardless of the method studied over the past few decades, researchers have constantly searched for the best trade-off between accuracy and speed to achieve better performance.

2.1 Conventional Stereo Matching

Stereo algorithms usually include the following four steps scharstein2002taxonomy : matching cost computation, cost aggregation, disparity computation/optimization, and disparity refinement. For correspondence problems in cost aggregation, the local method only uses the gray level, color, gradient, and other information of a certain neighborhood to calculate the matching cost, which has low computational complexity hosni2009local ; bleyer2011patchmatch ; hosni2012fast . Instead, Global approaches use pixel-based matching cost to search for disparity assignments that minimize the energy function over the entire stereo pair terzopoulos1986regularization and relative research mainly focuses on minimization procedure being used such as Graph Cut boykov2001fast , Markov Random Fields yamaguchi2014efficient and Dynamic Programming ohta1985stereo . Combined with the above two methods, SGM hirschmuller2005accurate still adopts the global framework, but uses a more efficient one-dimensional path aggregation method to replace the two-dimensional minimization algorithm. SGM hirschmuller2005accurate significantly improves the algorithm’s efficiency while being competitive with global algorithms in terms of accuracy.

2.2 Deep-learning-based Stereo Matching

To deal with the problem of specular surfaces, ambiguous regions, repetitive patterns, occlusions, and discontinuities, many modern studies use deep learning approaches to replace some or even all steps in the traditional matching process. These deep-learning-based stereo-matching algorithms can be categorized into two groups: accuracy-enhanced and real-time optimized.

Accuracy-Focused Stereo Matching. Many contemporary stereo-matching methods are dedicated to enhancing accuracy through various techniques and optimizations. Among these, 3D end-to-end networks have introduced significantly heightened precision in disparity estimation within stereo matching psmnet2018 ; gwcnet2019 ; leastereo ; acvnet ; xu2023iterative ; guo2023openstereo . PSMNet psmnet2018 adopts an architecture that leverages spatial pyramid pooling and 3D convolutional neural networks to learn disparity estimation from stereo images. GwcNet gwcnet2019 introduces a novel approach for constructing the cost volume in stereo matching using group-wise correlation. By dividing left and right features into groups along the channel dimension, correlation maps are computed within each group to generate multiple matching cost proposals. LEAStereo leastereo employs a hierarchical neural architecture search (NAS) framework to enhance deep stereo matching performance. IGEV xu2023iterative constructs a unified geometry encoding volume, integrating geometry, contextual cues, and local matching intricacies. OpenStereo guo2023openstereo conducted a comprehensive benchmark with a focus on practical applicability and introduced StereoBase, which further elevates the performance ceiling of stereo matching. To further optimize Stereo matching methods based on iterative optimization, Selective-IGEV SelectiveStereo proposes the Selective Recurrent Unit to help integrate disparity information across frequencies, minimizing loss during iterative steps. In addition, work  ganetADL improves the loss function by introducing ADL, an adaptive multi-modal cross-entropy loss, to guide network learning of diverse pixel distribution patterns.

Real-Time-Focused Stereo Matching. To facilitate the real-time deployment of stereo-matching algorithms, a multitude of strategies have been proposed to optimize their practical application. StereoNet khamis2018stereonet uses color inputs to guide hierarchical refinement and can recover high-frequency details. DeepPrunerdeeppruner2019 proposes a PatchMatch module with learnable parameters to save memory and computation by gradually trimming the disparity space to be searched for each pixel. AnyNetanynet2019 has a 2D image convolutional network and a 3D cost tensor convolutional network. It computes disparity estimates at any time by continuously refining the disparity map resolution using the up-sampling method. FADNetwang2020fadnet leverages efficient 2D correlation layers with residual structures to perform multi-scale predictions. Based on the learned bilateral grid, BGNetbgnet designs an edge-preserving cost volume up-sampling module, allowing computationally expensive operations such as 3D convolution to be performed at low resolution. CoExbangunharcana2021coex uses image feature-guided weights and cost volume to excite 3D CNN to extract relevant geometric features. Building upon the foundation of StereoNet khamis2018stereonet , MobileStereoNetshamsafar2022mobilestereonet introduces two stereo vision models suitable for resource-constrained devices, achieving substantial reductions in both parameters and computational operations. However, existing stereo-matching methods still have considerable room for improvement in terms of computational efficiency and parameter scale.

After carefully considering and testing various network structures, we propose a lightweight network, called LightStereo. This model innovatively explores 2D cost aggregation for stereo matching, leveraging advanced techniques to enhance both accuracy and computational efficiency.

Refer to caption
Figure 2: The diagram above illustrates the architecture of LightStereo. MSCA refers to Multi-Scale Convolutional Attention module.

3 Method

In the deployment of stereo matching on edge devices, constructing a 4D cost volume and using 3D CNNs for cost aggregation prove to be extremely inefficient. Our objective is to construct a 3D cost volume and utilize 2D CNNs, augmented with channel boost, for cost aggregation. This approach aims to maintain a balance between efficiency and accuracy, making it more suitable for real-time applications on resource-constrained devices. Our proposed LightStereo, as depicted in Figure 2, utilizes inverted residual blocks to aggregate disparity from a low-resolution cost volume. In this section, we first describe the design of channel-boosted inverted residual blocks for 2D cost aggregation (Section 3.1). Then, we introduce the Multi-Scale Convolutional Attention (MSCA) (Section 3.2). Finally, in Section 3.3, we present details of the network architecture of Lightstereo.

3.1 Inverted Residual Blocks for 2D Cost Aggregation

Previous efforts, such as MobilenetStereo-2D shamsafar2022mobilestereonet , have incorporated MobileNet mijwil2023mobilenetv1 ; sandler2018mobilenetv2 blocks to decrease computational costs. Despite these efforts, the results have not met expectations, with an EPE of 1.14 on SceneFLow sceneflow still being reported. To address this issue, our study introduces inverted residual block to boost disparity estimation accuracy. As shown in Figure 3 (c), the inverted residual block sandler2018mobilenetv2 is a fundamental component in the design of a lightweight stereo matching network. Given the cost volume whose size is H/4×W/4×Disp/4𝐻4𝑊4𝐷𝑖𝑠𝑝4H/4\times W/4\times Disp/4italic_H / 4 × italic_W / 4 × italic_D italic_i italic_s italic_p / 4, the key idea is first to boost the number of disparity channels, then apply depthwise convolution, and finally project the expanded features back to a lower-dimensional space, enhancing feature representation significantly. As shown in Figure 2, inverted residual blocks are utilized at resolutions of 1/4, 1/8, and 1/16, with each resolution corresponding to different blocks. For every block:

Initially, the cost volume 𝐂𝐂\mathbf{C}bold_C is first passed through a 1×1111\times 11 × 1 convolution to increase the number of disparity channel:

𝐲=σ(𝐖expand𝐂),𝐲𝜎subscript𝐖expand𝐂\mathbf{y}=\sigma(\mathbf{W}_{\text{expand}}\ast\mathbf{C}),bold_y = italic_σ ( bold_W start_POSTSUBSCRIPT expand end_POSTSUBSCRIPT ∗ bold_C ) , (1)

where 𝐖expandsubscript𝐖expand\mathbf{W}_{\text{expand}}bold_W start_POSTSUBSCRIPT expand end_POSTSUBSCRIPT represents the weights of the expansion convolution, \ast denotes the convolution operation, and σ𝜎\sigmaitalic_σ is the ReLU6 activation function. Next, the expanded feature map 𝐲𝐲\mathbf{y}bold_y undergoes a 3×3333\times 33 × 3 depthwise convolution, which operates independently on each disparity channel to capture spatial features:

𝐳=σ(𝐖depthwise𝐲),𝐳𝜎subscript𝐖depthwise𝐲\mathbf{z}=\sigma(\mathbf{W}_{\text{depthwise}}\ast\mathbf{y}),bold_z = italic_σ ( bold_W start_POSTSUBSCRIPT depthwise end_POSTSUBSCRIPT ∗ bold_y ) , (2)

where 𝐖depthwisesubscript𝐖depthwise\mathbf{W}_{\text{depthwise}}bold_W start_POSTSUBSCRIPT depthwise end_POSTSUBSCRIPT are the weights of the depthwise convolution. Finally, the result 𝐳𝐳\mathbf{z}bold_z is then passed through another 1×1111\times 11 × 1 convolution to reduce the number of channels back to the original dimension:

𝐨𝐮𝐭=𝐖project𝐳,𝐨𝐮𝐭subscript𝐖project𝐳\mathbf{out}=\mathbf{W}_{\text{project}}\ast\mathbf{z},bold_out = bold_W start_POSTSUBSCRIPT project end_POSTSUBSCRIPT ∗ bold_z , (3)

where 𝐖projectsubscript𝐖project\mathbf{W}_{\text{project}}bold_W start_POSTSUBSCRIPT project end_POSTSUBSCRIPT represents the weights of the projection convolution. If the input and output dimensions match, a skip connection is added to improve gradient flow and model performance:

𝐨𝐮𝐭=𝐨𝐮𝐭+𝐱if dimensions match.𝐨𝐮𝐭𝐨𝐮𝐭𝐱if dimensions match\mathbf{out}=\mathbf{out}+\mathbf{x}\quad\text{if dimensions match}.bold_out = bold_out + bold_x if dimensions match . (4)

The inverted residual block’s structure significantly reduces computational complexity, making it ideal for resource-constrained environments. Our experimental results demonstrate the superiority of the inverted residual block design compared to regular CNN block (Figure 3 (a)), V1 block mijwil2023mobilenetv1 (Figure 3 (b)), and Vision Transformer (VIT) block liu2023efficientvit (Figure 3 (d)).

Refer to caption
Figure 3: Comparison of different blocks for cost aggregation. DW refers to depthwise separable convolution. V1 Block represents the depthwise separable convolution mijwil2023mobilenetv1 . V2 Block represents an inverted residual block sandler2018mobilenetv2 . ViT Block refers to the block used in EfficientViT liu2023efficientvit .

3.2 Multi-Scale Convolutional Attention Module

Inspired by the effectiveness of large kernel and strip convolutions in image segmentation peng2017large ; hou2020strip ; guo2022segnext , we propose a Multi-Scale Convolutional Attention Module (MSCA) module for enhancing cost aggregation by extracting features from the left image. The bottom-left corner of Figure 2 illustrates the architecture of the Multi-Scale Convolutional Attention (MSCA) module. This module is designed to capture and integrate features at multiple scales to enhance the feature representation for cost aggregation. The MSCA incorporates a series of depthwise separable convolutions with varying kernel sizes, specifically 1×1111\times 11 × 1, 7×1717\times 17 × 1, 1×7171\times 71 × 7, 11×111111\times 111 × 1, 1×111111\times 111 × 11, 21×121121\times 121 × 1, and 1×211211\times 211 × 21. These convolutions capture both horizontal and vertical strip-like features, which are crucial for identifying elongated structures within the image. The reason we choose depth-wise strip convolutions is that they are lightweight. For example, by using a pair of 7×1717\times 17 × 1 and 1×7171\times 71 × 7 convolutions, we can effectively replace a standard 7×7777\times 77 × 7 convolution. Thus, strip convolutions serve as a complement to grid convolutions and aid in extracting strip-like features of left images. Utilizing multi-scale image features, MSCA enhances the channel dimension of the cost volume. MSCA incorporates semantic information embedded within the images to direct the cost aggregation process. The network is designed to cease propagation upon detecting disparities that are discontinuous.

Given a stereo image input of size H×W×3𝐻𝑊3H\times W\times 3italic_H × italic_W × 3, the method obtains three scales of feature maps at 1/4, 1/8, and 1/16 of the original resolution, respectively. Then, these multi-scale feature maps are then processed by the MSCA module to further extract horizontal and vertical strip-like features. The outputs of MSCA are then concatenated to form a comprehensive multi-scale feature representation. This aggregated feature is subsequently processed by a 1×1111\times 11 × 1 convolution, which acts as a channel mixer. The 1×1111\times 11 × 1 convolution blends the multi-scale features and recalibrates the feature channels, enhancing the network’s ability to focus on relevant information across different scales. After the MSCA processing, the final output is combined with the aggregated cost by multiplication, enhancing the precision of cost aggregation.

3.3 Network Architecture of LightStereo

Our proposed LightStereo consists of five components: feature extraction, cost computation, cost aggregation, disparity prediction, and loss. In the following, we provide introductions to each module.

Multi-scale Feature Extraction. For a stereo image pair input with dimensions H × W × 3, our methodology harnesses a MobileNetV2 bangunharcana2021coex ; xu2023iterative model, previously trained on the ImageNet deng2009imagenet dataset, serving as the core architectural foundation. We enable the extraction of feature maps across four distinct scales, effectively reducing the resolution to 1/4, 1/8, 1/16, and 1/32 of the initial size, respectively. Subsequently, upsampling blocks with skip connections are utilized to restore these feature maps to 1/4 scale, which is used to construct the cost volume.

Cost Volume. Utilizing the left features 𝐟l,4subscript𝐟𝑙4\mathbf{f}_{l,4}bold_f start_POSTSUBSCRIPT italic_l , 4 end_POSTSUBSCRIPT and right features 𝐟r,4subscript𝐟𝑟4\mathbf{f}_{r,4}bold_f start_POSTSUBSCRIPT italic_r , 4 end_POSTSUBSCRIPT extracted from images 𝐈lsubscript𝐈𝑙\mathbf{I}_{l}bold_I start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT and 𝐈rsubscript𝐈𝑟\mathbf{I}_{r}bold_I start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT, we construct a correlation volume. The correlation cost volume construction involves comparing features from 𝐟l,4subscript𝐟𝑙4\mathbf{f}_{l,4}bold_f start_POSTSUBSCRIPT italic_l , 4 end_POSTSUBSCRIPT and 𝐟r,4subscript𝐟𝑟4\mathbf{f}_{r,4}bold_f start_POSTSUBSCRIPT italic_r , 4 end_POSTSUBSCRIPT across different disparity levels. For each disparity d𝑑ditalic_d within the range of 0 to D1𝐷1D-1italic_D - 1, the similarity between 𝐟l,4subscript𝐟𝑙4\mathbf{f}_{l,4}bold_f start_POSTSUBSCRIPT italic_l , 4 end_POSTSUBSCRIPT and 𝐟r,4subscript𝐟𝑟4\mathbf{f}_{r,4}bold_f start_POSTSUBSCRIPT italic_r , 4 end_POSTSUBSCRIPT shifted by d𝑑ditalic_d pixels is computed. Mathematically, this can be expressed as:

Ccor(d,h,w)=1Cc=1Cfl,4(h,w)fr,4(h,wd).subscript𝐶𝑐𝑜𝑟𝑑𝑤1𝐶superscriptsubscript𝑐1𝐶subscript𝑓𝑙4𝑤subscript𝑓𝑟4𝑤𝑑C_{cor}(d,h,w)=\frac{1}{C}\sum_{c=1}^{C}{f}_{l,4}(h,w)\cdot{f}_{r,4}(h,w-d).italic_C start_POSTSUBSCRIPT italic_c italic_o italic_r end_POSTSUBSCRIPT ( italic_d , italic_h , italic_w ) = divide start_ARG 1 end_ARG start_ARG italic_C end_ARG ∑ start_POSTSUBSCRIPT italic_c = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_l , 4 end_POSTSUBSCRIPT ( italic_h , italic_w ) ⋅ italic_f start_POSTSUBSCRIPT italic_r , 4 end_POSTSUBSCRIPT ( italic_h , italic_w - italic_d ) . (5)

For d=0𝑑0d=0italic_d = 0, the cost is computed as the average of the element-wise product of feature vectors from 𝐟l,4subscript𝐟𝑙4\mathbf{f}_{l,4}bold_f start_POSTSUBSCRIPT italic_l , 4 end_POSTSUBSCRIPT and 𝐟r,4subscript𝐟𝑟4\mathbf{f}_{r,4}bold_f start_POSTSUBSCRIPT italic_r , 4 end_POSTSUBSCRIPT at the same spatial location, across all channels.

Cost Aggregation. We use inverted residual blocks to aggregate the cost volume at 1/4, 1/8, and 1/16 resolutions, and at each resolution, we apply multi-scale convolutional attention using the left image, as described in Section 3.1 and Section 3.2. We develop three variants of LightStereo based on block and expansion variations: LightStereo-S for small-scale applications, LightStereo-M for medium-scale tasks, and LightStereo-L. In LightStereo-S, the inverted residual blocks are configured as (1, 2, 4) with an expansion factor of 4. For LightStereo-M, the blocks are set as (4, 8, 14) with an expansion factor of 4. Finally, LightStereo-L utilizes blocks (8, 16, 32) with an expansion factor of 8. LightStereo-L denotes that the backbone of LightStereo-L has been replaced with EfficientnetV2 tan2021efficientnetv2 .

Disparity Regression. We utilize disparity regression gcnet ; psmnet2018 to estimate the disparity map. This method predicts disparities by leveraging the probability distribution of each disparity d𝑑ditalic_d, derived from the predicted cost cdsubscript𝑐𝑑c_{d}italic_c start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT through the softmax operation σ()𝜎\sigma(\cdot)italic_σ ( ⋅ ). The predicted disparity d^^𝑑\hat{d}over^ start_ARG italic_d end_ARG is then determined by summing each disparity d𝑑ditalic_d weighted by its probability, as depicted by the equation:

d^=d=0Dmaxd×σ(cd).^𝑑superscriptsubscript𝑑0subscript𝐷max𝑑𝜎subscript𝑐𝑑\hat{d}=\sum_{d=0}^{D_{\text{max}}}d\times\sigma(c_{d}).over^ start_ARG italic_d end_ARG = ∑ start_POSTSUBSCRIPT italic_d = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT max end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_d × italic_σ ( italic_c start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) . (6)

Loss We employ the smooth L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT loss to train the proposed LightStereo. The loss is defined as:

L(d,d^)=1Ni=1NsmoothL1(didi^),𝐿𝑑^𝑑1𝑁superscriptsubscript𝑖1𝑁subscriptsmoothsubscript𝐿1subscript𝑑𝑖^subscript𝑑𝑖L(d,\hat{d})=\frac{1}{N}\sum_{i=1}^{N}\text{smooth}_{L_{1}}(d_{i}-\hat{d_{i}}),italic_L ( italic_d , over^ start_ARG italic_d end_ARG ) = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT smooth start_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - over^ start_ARG italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ) , (7)

where N𝑁Nitalic_N is the number of labeled pixels, d𝑑ditalic_d represents the ground-truth disparity, and d^^𝑑\hat{d}over^ start_ARG italic_d end_ARG denotes the predicted disparity.

4 Experiment

4.1 Datasets and Evaluation Metrics

SceneFlowsceneflow is a synthetic stereo collection that provides 35,454 and 4,370 image pairs for training and testing, respectively. The dataset has a resolution of 960×540 and uses dense disparity maps as ground truth. The data is split into two categories: cleanpass and finalpass. Cleanpass refers to the synthetic images that are generated by clean renderings without post-processing, while finalpass images are produced with photorealistic settings such as motion blur, defocus blur, and noise. In evaluations, we utilize the widely-used metrics the end point error (EPE) as the evaluation metrics.

KITTI. KITTI 2012kitti2012 and KITTI 2015kitti2015 are datasets that captured from real-world scenes. KITTI 2012 contains 194 and 195 image pairs for training and testing, respectively. KITTI 2015 provides two sets with 200 image pairs for training and testing. All KITTI datasets provide sparse ground-truth disparities guided by the LiDAR system. For evaluations, we calculate EPE and the percentage of pixels with EPE larger than 3 pixels in all (D1-all) regions. All two KITTI datasets are also used for cross-domain generalization performance evaluation, with EPE and a >3px metric (i.e., the percentage of points with absolute error larger than 3 pixels) reported.

4.2 Implementation Details

We have implemented our methods using PyTorch and conducted experiments on 8 NVIDIA RTX 3090 GPUs. For the SceneFLow sceneflow datasets, the batch sizes for LightStereo-S, LightStereo-M, LightStereo-L, and LightStereo-L are set at 24, 12, 8, and 6, respectively. We utilize the AdamW optimizer coupled with OneCycleLR scheduling, where the maximum learning rate was set to 0.0001 multiplied by the batch size. In the ablation study, these LightStereo models were trained over 50 epochs. For the final model evaluation, LightStereo underwent training for 90 epochs. The random crop was used for data augmentation. For the KITTI dataset, we fine-tuned the pre-trained models on the SceneFlow dataset sceneflow for 500 epochs using a mixed training set comprising KITTI 2012 kitti2012 and KITTI2015 kitti2015 training datasets. We employed a batch size of 2 and utilized the AdamW optimizer. OneCycleLR scheduling was used with a max learning rate of 0.0002. Data augmentation techniques including color jitter, random erase, random scale, and random crop were employed.

Refer to caption
Figure 4: Qualitative results on the test set of Sceneflow sceneflow . The first column presents the left image on the first row and the right image on the second row. Columns 2 to 5 display the predicted disparity maps on the first row and the corresponding error maps on the second row.
Table 1: Comparison with the state-of-the-art on SceneFlowsceneflow . The runtime is tested on our RTX 3090 GPU.
Method FLOPs(G) Params(M) EPE (px) Runtime (ms)
DeepPruner-Fastdeeppruner2019 219.12 7.47 1.25 39
StereoNetstereonet2018 85.93 0.40 1.10 20
2D-MobileStereoNetshamsafar2022mobilestereonet 128.84 2.23 1.11 73
FADNet++wang2021fadnet++ 148.21 124.26 0.85 21
CoExbangunharcana2021coex 53.39 2.72 0.67 36
Fast-ACVNet fast-acv 79.34 3.08 0.64 22
Fast-ACVNet+ fast-acv 93.08 3.20 0.59 27
AANetxu2020aanet 152.86 2.97 0.87 93
IINetli2024iinet 90.16 19.54 0.54 26
LightStereo-S(Ours) 22.71 3.44 0.73 17
LightStereo-M(Ours) 36.36 7.64 0.62 23
LightStereo-L(Ours) 91.85 24.29 0.59 37
LightStereo-L(Ours) 159.26 45.63 0.51 54
Table 2: Comparison with the state-of-the-art methods on KITTI benchmarks. The methods are categorized based on whether its runtime exceeds 100ms. Numbers in bold represent the highest values, while numbers underlined indicate the second-highest values. The runtime of the model with is tested on our RTX 3090 GPU.
Target Method KITTI 2012kitti2012 KITTI 2015kitti2015
3-noc 3-all 4-noc 4-all EPE noc EPE all D1-bg D1-fg D1-all Runtime (ms)
Accuracy GANetganet2019 1.19 1.60 0.91 1.23 0.4 0.5 1.48 3.46 1.81 1800
LaC+GANetLaC+GANet 1.05 1.42 0.80 1.09 0.4 0.5 1.44 2.83 1.67 1800
CFNetcfnet2021 1.23 1.58 0.92 1.18 0.4 0.5 1.54 3.56 1.88 180
SegStereosegstereo2018 1.68 2.03 1.25 1.52 0.5 0.6 1.88 4.07 2.25 600
SSPCVNet SSPCVNet 1.47 1.90 1.08 1.41 0.5 0.6 1.75 3.89 2.11 900
EdgeStereo-V2song2020edgestereo 1.46 1.83 1.07 1.34 0.4 0.5 1.84 3.30 2.08 320
CSPNCSPN 1.19 1.53 0.93 1.19 - - 1.51 2.88 1.74 1000
LEAStereoleastereo 1.13 1.45 0.83 1.08 0.5 0.5 1.40 2.91 1.65 300
CREStereoCrestereo 1.14 1.46 0.90 1.14 0.4 0.5 1.45 2.86 1.69 410
ACVNet acvnet 1.13 1.47 0.86 1.12 0.4 0.5 1.37 3.07 1.65 200
Speed DispNetCsceneflow 4.11 4.65 2.77 3.20 0.9 1.0 4.32 4.41 4.34 60
StereoNet stereonet2018 - - - - 0.8 0.9 4.30 7.45 4.83 22
DeepPrunerFastdeeppruner2019 - - - - - - 2.32 3.91 2.59 50
AANetxu2020aanet 1.91 2.42 1.46 1.87 0.5 0.6 1.99 5.39 2.55 62
DecNetDecNet - - - - - - 2.07 3.87 2.37 50
HITNethitnet 1.41 1.89 1.14 1.53 0.4 0.5 1.74 3.20 1.98 54
CoExbangunharcana2021coex 1.55 1.93 1.15 1.42 0.5 0.5 1.79 3.82 2.13 33
Fast-ACVNetfast-acv 1.68 2.13 1.23 1.56 0.5 0.6 1.82 3.93 2.17 39
Fast-ACVNet+fast-acv 1.45 1.85 1.06 1.36 0.5 0.5 1.70 3.53 2.01 45
LightStereo-S (Ours) 1.88 2.34 1.30 1.65 0.6 0.6 2.00 3.80 2.30 17
LightStereo-M (Ours) 1.56 1.91 1.10 1.36 0.5 0.5 1.81 3.22 2.04 23
LightStereo-L (Ours) 1.55 1.87 1.10 1.33 0.5 0.5 1.78 2.64 1.93 34
LightStereo-L(Ours) 1.34 1.62 0.96 1.17 0.5 0.5 1.60 2.92 1.82 49

4.3 Comparisons with State-of-the-art Methods

As shown in Table 1, we compare our proposed LightStereo methods with several state-of-the-art stereo matching approaches on the SceneFlow sceneflow dataset. In terms of runtime, LightStereo-S achieves a runtime of only 17ms. This is substantially lower than other methods. Regarding model complexity, LightStereo-S strikes a favorable balance with only 22.71Gflops, comparable to StereoNet stereonet2018 (85.93Gflops). This reflects our commitment to maintaining a lightweight model while ensuring competitive performance. In terms of accuracy, LightStereo-L achieves an EPE of 0.51, which although slightly higher than some methods like Fast-ACVNet+ fast-acv (0.59) and IINet li2024iinet (0.54), remains competitive while offering significantly lower computational costs. Overall, our LightStereo framework, particularly the LightStereo-S configuration, presents a compelling solution for real-time stereo matching, offering a favorable trade-off between computational efficiency and depth estimation accuracy. This makes it highly suitable for applications such as autonomous navigation and augmented reality, where rapid and reliable depth perception is essential. Figure 4 presents the visualization results of the four proposed models on the SceneFlow sceneflow dataset. Additionally, we evaluated the results of our proposed models on the KITTI 2012 and 2015 benchmarks. As illustrated in Table 2, the runtime of LightStereo-S exceeds all methods. Notably, LightStereo-L surpasses all other lightweight state-of-the-art stereo matching networks across every evaluation metric on the KITTI12 dataset. Additionally, LightStereo-L achieved top performance on the KITTI15 dataset in terms of D1-bg and D1-all metrics among lightweight stereo matching methods.

4.4 Ablation Study

Table 3: Ablation study of backbone on SceneFLow sceneflow .
Backbone Type EPE (px) Flops(G) Param(M) time(ms)
MobilenetV2 sandler2018mobilenetv2 CNN 0.7144 35.82 7.54 22.93
MobilenetV3 mobilenetv3 CNN 0.7292 35.72 9.16 25.02
StarNet StarNet CNN 0.7247 40.21 8.98 26.63
EfficientnetV2 tan2021efficientnetv2 CNN 0.6130 103.14 28.87 46.83
RepVIT wang2023repvit Transformer 0.6823 50.45 9.56 28.65

Backbone. This paper aims to design a lightweight stereo-matching model. To achieve this, we explored the use of classic lightweight models as backbones. As illustrated in Table 3, MobilenetV2 sandler2018mobilenetv2 demonstrated a balanced performance with an EPE of 0.7144, FLOPs of 35.82G, 7.54M parameters, and an inference time of 22.93ms. Although MobilenetV3 mobilenetv3 had slightly lower FLOPs at 35.72G, it showed a higher EPE of 0.7292 and required more parameters (9.16M) with a longer inference time (25.02ms). StarNet StarNet exhibited an EPE of 0.7247 with higher FLOPs (40.21G) and inference time (26.63ms). EfficientnetV2 tan2021efficientnetv2 achieved the lowest EPE of 0.6130, but at the cost of significantly higher computational resources (103.14G FLOPs) and parameters (28.87M), with an inference time of 46.83ms. RepVIT wang2023repvit , a transformer-based model, showed an EPE of 0.6876 but required 50.45GFLOPs and 9.56M parameters with an inference time of 28.65ms. Therefore, MobilenetV2 is chosen for its overall efficiency and balance between accuracy and computational cost.

Refer to caption
Figure 5: Qualitative results on the test set of Sceneflow sceneflow . The first column presents the left image on the first row and the right image on the second row. Columns 2 to 5 display the predicted disparity maps on the first row and the corresponding error maps on the second row.
Table 4: Ablation study on the design of the block on SceneFLow sceneflow . Regular represents regular convolution. V1 Block represents the depthwise separable convolution mijwil2023mobilenetv1 . V2 Block represents an inverted residual block sandler2018mobilenetv2 . DW refers to depthwise separable convolution. ViT Block refers to the block used in EfficientViT liu2023efficientvit . The underlined backbone is the one we ultimately used.
Conv. Type Kernel Block EPE (px) Flops(G) Param(M) time(ms)
Regular 3×3333\times 33 × 3 (4 8 16) 0.7652 36.27 8.04 22.75
Regular 5×5555\times 55 × 5 (4 8 16) 0.7979 71.16 18.62 17.61
Regular 7×7777\times 77 × 7 (4 8 16) 0.8190 123.51 34.49 20.06
Regular 11×11111111\times 1111 × 11 (4 8 16) 0.8672 280.53 82.10 34.22
V1 Block DW 3×3333\times 33 × 3 (30 60 120) 0.7801 34.86 7.57 54.21
V2 Block DW 3×3333\times 33 × 3 (4 8 16) 0.7144 35.82 7.54 22.93
ViT Block - (3 6 9) 0.7149 34.48 6.53 51.14

Conv. Type. As illustrated in Table 4, our experiments highlight the critical importance of disparity dimensions in the cost aggregation process of stereo matching, rather than spatial expansions in the height and width dimensions. Regular convolutions with different kernel sizes showed that increasing kernel size resulted in higher EPE and significantly increased computational cost. Specifically, expanding the receptive field in the height and width dimensions did not significantly contribute to performance improvement. This finding indicates that merely increasing the spatial extent of convolutions does not effectively enhance the accuracy of stereo-matching. Instead, the emphasis should be placed on optimizing the disparity dimensions to improve the aggregation of cost volumes, as this approach is more effective in capturing relevant features and enhancing overall performance. The V1 block with a 3×3333\times 33 × 3 kernel showed an EPE of 0.7801, indicating moderate performance with reduced FLOPs (34.86G), but it had a higher inference time (54.21ms). In contrast, the V2 block provided the best balance, achieving the lowest EPE of 0.7144, and FLOPs of 35.82G, with an inference time of 22.93ms. This superior performance is attributed to the structure of the V2 block, which incorporates expansion convolutions in the disparity dimension. By focusing on enhancing the disparity dimensions rather than merely expanding the spatial dimensions, the V2 block effectively aggregates cost volumes, leading to improved accuracy and efficiency in stereo-matching tasks. The ViT block showed a similar EPE to the V2 block but had a lower parameter count (6.53M) and higher inference time (51.14ms). Ultimately, the V2 block was chosen for its optimal balance between accuracy and computational efficiency. Figure 5 shows a visual comparison with different conv. type. It can be observed that for the occluded area in the lower left corner of the left image, the cost aggregation based on the V2 block achieves better results than the other 3 conv. type.

Block Structure Analysis. As illustrated in Table 5, we explore the impact of different block structures while maintaining a constant expansion factor of 4. Configurations ’e’ to ’h’ explore blocks (1, 2, 4), (2, 4, 8), (4, 8, 16), and (8, 16, 32) respectively. The results show that larger blocks generally lead to better performance. For instance, configuration ’e’ with the smallest block (1, 2, 4) has the highest EPE of 0.8317, while configuration ’h’ with the largest block (8, 16, 32) shows an improvement in EPE, but detailed metrics are not provided. There is a trade-off between accuracy and computational cost, as larger blocks increase FLOPs and parameters.

Expansion Factor Analysis. As illustrated in Table 5, we further analyze the effect of varying expansion factors while kee** the block structure constant. As the expansion factor increases, a consistent decrease in EPE is observed, indicating improved accuracy. Specifically, configuration ’a’ with an expansion factor of 2 has an EPE of 0.7557, while configuration ’d’ with an expansion factor of 16 achieves the lowest EPE of 0.6650. The analysis further proves that the critical importance of disparity dimensions cannot be overstated in the cost aggregation process of stereo matching. The disparity dimension is pivotal in effectively aggregating cost volumes, leading to more accurate and reliable depth estimations. However, this improvement in accuracy comes at the cost of significantly increased computational complexity and parameters, with flops increasing from 26.23G to 93.34G and the inference time increases from 22.44ms to 36.67ms.

Table 5: Ablation study on the design of the block on SceneFLow sceneflow . Expansion refers to the expansion factor. SE represents Squeeze-and-Excitation hu2018squeeze . MSCA represents Multi-Scale Convolutional Attention.
Block Expansion SE MSCA EPE (px) Flops(G) Param(M) time(ms)
a (4 8 16) (2 2 2) 0.7557 26.23 4.81 22.44
b (4 8 16) (4 4 4) 0.7144 35.82 7.54 22.93
c (4 8 16) (8 8 8) 0.6853 54.99 12.99 23.59
d (4 8 16) (16 16 16) 0.6650 93.34 23.90 36.67
e (1 2 4) (4 4 4) 0.8317 22.17 3.34 16.65
f (2 4 8) (4 4 4) 0.7464 26.72 4.74 20.04
g (4 8 16) (4 4 4) 0.7144 35.82 7.54 22.93
h (8 16 32) (4 4 4) 0.6973 54.02 13.14 31.09
i (4 8 16) (4 4 4) 0.7144 35.82 7.54 22.93
j (4 8 16) (4 4 4) 0.7036 35.90 12.76 30.14
k (4 8 16) (4 4 4) 0.6809 36.36 7.64 23.14
l (4 8 16) (4 4 4) 0.6810 36.44 12.86 30.82
m (1 2 4) (4 4 4) 0.7899 22.71 3.44 17.59
n (4 8 16) (4 4 4) 0.6809 36.36 7.64 23.14
o (8 16 32) (8 8 8) 0.6382 91.85 24.29 37.55

MSCA Module Analysis. As illustrated in Table 5, the baseline configuration (i) with blocks (4, 8, 16) achieved an EPE of 0.7144. Incorporating MSCA (configuration k) reduced the EPE further to 0.6809 with minimal changes in FLOPs and parameters. This suggests that MSCA provides the most significant improvement in accuracy with minimal impact on computational efficiency. We also explored the use of the SE module hu2018squeeze . Configurations ’i’ to ’j’ show that although the SE module improves accuracy, reducing EPE from 0.7144 to 0.7036. However, this improvement comes with an increase in parameters from 7.54M to 12.76M and a slight increase in FLOPs and inference time.

Table 6: Runtime breakdown.
Module Feature Extraction Cost Cost Aggregation Disparity Regression Total Time
LightStereo-S 10.39 1.98 3.98 1.48 17.83
LightStereo-M 10.39 1.98 9.59 1.49 23.45
LightStereo-L 10.39 1.98 23.64 1.49 37.50
LightStereo-L 27.17 1.98 23.64 1.49 54.28

4.5 Runtime Analysis

As shown in Table 6, it is evident that the LightStereo framework demonstrates commendable efficiency across its constituent modules. Each component, including feature extraction, cost computation, cost aggregation, and disparity regression, contributes to the overall computational time. It is evident that as the number of blocks increases, the time required for cost aggregation in LightStereo-S, LightStereo-M, and LightStereo-L also increases. For LightStereo-L*, the time spent on feature extraction is extended due to the adoption of a more complex EfficientNetV2. These runtime breakdowns underscore the effectiveness of the LightStereo framework in achieving real-time stereo-matching capabilities.

5 Conclusion

In conclusion, this paper designs a lightweight 2D encoder-decoder aggregation net to achieve precise and fast disparity estimation, called LightStereo. While similar approaches have been explored previously, our novel contribution lies in optimizing performance through a targeted emphasis on the disparity channel dimension within the 3D cost volume, which encapsulates the distribution of matching costs. Our exhaustive exploration has led to the development of numerous strategies to enhance the capacity of this crucial dimension, ensuring both precision and efficiency in disparity estimation. LightStereo offers a compelling solution for accelerating the matching process while maintaining high levels of accuracy and efficiency.

Acknowledgements. This work was supported by the National Natural Science Foundation of China under Grant 62373356 and the Open Projects Program of the State Key Laboratory of Multimodal Artificial Intelligence Systems.

References

  • [1] Antyanta Bangunharcana, Jae Won Cho, Seokju Lee, In So Kweon, Kyung-Soo Kim, and Soohyun Kim. Correlate-and-excite: Real-time stereo matching via guided cost volume excitation. In IROS, 2021.
  • [2] Michael Bleyer, Christoph Rhemann, and Carsten Rother. Patchmatch stereo-stereo matching with slanted support windows. In Bmvc, 2011.
  • [3] Yuri Boykov, Olga Veksler, and Ramin Zabih. Fast approximate energy minimization via graph cuts. TPAMI, 2001.
  • [4] Jia-Ren Chang and Yong-Sheng Chen. Pyramid stereo matching network. In CVPR, 2018.
  • [5] Xin**g Cheng, Peng Wang, and Ruigang Yang. Learning depth with convolutional spatial propagation network. TPAMI, 2019.
  • [6] Xuelian Cheng, Yiran Zhong, Mehrtash Harandi, Yuchao Dai, Xiaojun Chang, Hongdong Li, Tom Drummond, and Zongyuan Ge. Hierarchical neural architecture search for deep stereo matching. In NeurIPS, 2020.
  • [7] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. ImageNet: A large-scale hierarchical image database. In CVPR, 2009.
  • [8] Shivam Duggal, Shenlong Wang, Wei-Chiu Ma, Rui Hu, and Raquel Urtasun. Deeppruner: Learning efficient stereo matching via differentiable patchmatch. In ICCV, 2019.
  • [9] Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we ready for autonomous driving? the kitti vision benchmark suite. In CVPR, 2012.
  • [10] Xiaodong Gu, Zhiwen Fan, Siyu Zhu, Zuozhuo Dai, Feitong Tan, and ** Tan. Cascade cost volume for high-resolution multi-view stereo and stereo matching. In CVPR, 2020.
  • [11] Meng-Hao Guo, Cheng-Ze Lu, Qibin Hou, Zhengning Liu, Ming-Ming Cheng, and Shi-Min Hu. Segnext: Rethinking convolutional attention design for semantic segmentation. NeurIPS, 2022.
  • [12] Xianda Guo, Juntao Lu, Chenming Zhang, Yiqi Wang, Yiqun Duan, Tian Yang, Zheng Zhu, and Long Chen. Openstereo: A comprehensive benchmark for stereo matching and strong baseline. arXiv preprint arXiv:2312.00343, 2023.
  • [13] Xiaoyang Guo, Kai Yang, Wukui Yang, Xiaogang Wang, and Hongsheng Li. Group-wise correlation stereo network. In CVPR, 2019.
  • [14] Heiko Hirschmuller. Accurate and efficient stereo processing by semi-global matching and mutual information. In CVPR, 2005.
  • [15] Asmaa Hosni, Michael Bleyer, Margrit Gelautz, and Christoph Rhemann. Local stereo matching using geodesic support weights. In ICIP, 2009.
  • [16] Asmaa Hosni, Christoph Rhemann, Michael Bleyer, Carsten Rother, and Margrit Gelautz. Fast cost-volume filtering for visual correspondence and beyond. TPAMI, 2012.
  • [17] Qibin Hou, Li Zhang, Ming-Ming Cheng, and Jiashi Feng. Strip pooling: Rethinking spatial pooling for scene parsing. In CVPR, 2020.
  • [18] Andrew Howard, Mark Sandler, Grace Chu, Liang-Chieh Chen, Bo Chen, Mingxing Tan, Weijun Wang, Yukun Zhu, Ruoming Pang, Vijay Vasudevan, et al. Searching for mobilenetv3. In ICCV, 2019.
  • [19] Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. In CVPR, 2018.
  • [20] Alex Kendall, Hayk Martirosyan, Saumitro Dasgupta, Peter Henry, Ryan Kennedy, Abraham Bachrach, and Adam Bry. End-to-end learning of geometry and context for deep stereo regression. In ICCV, 2017.
  • [21] Sameh Khamis, Sean Fanello, Christoph Rhemann, Adarsh Kowdle, Julien Valentin, and Shahram Izadi. Stereonet: Guided hierarchical refinement for real-time edge-aware depth prediction. In ECCV, 2018.
  • [22] Sameh Khamis, Sean Fanello, Christoph Rhemann, Adarsh Kowdle, Julien Valentin, and Shahram Izadi. Stereonet: Guided hierarchical refinement for real-time edge-aware depth prediction. In ECCV, 2018.
  • [23] Jiankun Li, Peisen Wang, Pengfei Xiong, Tao Cai, Ziwei Yan, Lei Yang, Jiangyu Liu, Haoqiang Fan, and Shuaicheng Liu. Practical stereo matching via cascaded recurrent network with adaptive correlation. In CVPR, 2022.
  • [24] Ximeng Li, Chen Zhang, Wanjuan Su, and Wenbing Tao. Iinet: Implicit intra-inter information fusion for real-time stereo matching. In AAAI, 2024.
  • [25] Lahav Lipson, Zachary Teed, and Jia Deng. Raft-stereo: Multilevel recurrent field transforms for stereo matching. In 3DV, 2021.
  • [26] Biyang Liu, Huimin Yu, and Yangqi Long. Local similarity pattern and cost self-reassembling for deep stereo matching networks. In AAAI, 2022.
  • [27] Biyang Liu, Huimin Yu, and Yangqi Long. Local similarity pattern and cost self-reassembling for deep stereo matching networks. In AAAI, 2022.
  • [28] Xinyu Liu, Houwen Peng, Ningxin Zheng, Yuqing Yang, Han Hu, and Yixuan Yuan. Efficientvit: Memory efficient vision transformer with cascaded group attention. In CVPR, 2023.
  • [29] Xu Ma, Xiyang Dai, Yue Bai, Yizhou Wang, and Yun Fu. Rewrite the stars. CVPR, 2024.
  • [30] Nikolaus Mayer, Eddy Ilg, Philip Hausser, Philipp Fischer, Daniel Cremers, Alexey Dosovitskiy, and Thomas Brox. A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In CVPR, 2016.
  • [31] Moritz Menze and Andreas Geiger. Object scene flow for autonomous vehicles. In CVPR, 2015.
  • [32] Maad M Mijwil, Ruchi Doshi, Kamal Kant Hiran, Omega John Unogwu, and Indu Bala. Mobilenetv1-based deep learning model for accurate brain tumor classification. Mesopotamian Journal of Computer Science, 2023.
  • [33] Guang-Yu Nie, Ming-Ming Cheng, Yun Liu, Zhengfa Liang, Deng-** Fan, Yue Liu, and Yongtian Wang. Multi-level context ultra-aggregation for stereo matching. In CVPR, 2019.
  • [34] Yuichi Ohta and Takeo Kanade. Stereo by intra-and inter-scanline search using dynamic programming. TPAMI, 1985.
  • [35] Chao Peng, Xiangyu Zhang, Gang Yu, Guiming Luo, and Jian Sun. Large kernel matters–improve semantic segmentation by global convolutional network. In CVPR, 2017.
  • [36] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. In CVPR, 2018.
  • [37] Daniel Scharstein and Richard Szeliski. A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. IJCV, 2002.
  • [38] Faranak Shamsafar, Samuel Woerz, Rafia Rahim, and Andreas Zell. Mobilestereonet: Towards lightweight deep networks for stereo matching. In WACV, 2022.
  • [39] Zhelun Shen, Yuchao Dai, and Zhibo Rao. Cfnet: Cascade and fused cost volume for robust stereo matching. arXiv preprint arXiv:2104.04314, 2021.
  • [40] ** Shi. Adastereo: A simple and efficient approach for adaptive stereo matching. In CVPR, 2021.
  • [41] Xiao Song, Xu Zhao, Liangji Fang, Hanwen Hu, and Yizhou Yu. Edgestereo: An effective multi-task learning network for stereo matching and edge detection. IJCV, 2020.
  • [42] Mingxing Tan and Quoc Le. Efficientnetv2: Smaller models and faster training. In ICML, 2021.
  • [43] Vladimir Tankovich, Christian Hane, Yinda Zhang, Adarsh Kowdle, Sean Fanello, and Sofien Bouaziz. Hitnet: Hierarchical iterative tile refinement network for real-time stereo matching. In CVPR, 2021.
  • [44] Demetri Terzopoulos. Regularization of inverse visual problems involving discontinuities. TPAMI, 1986.
  • [45] Ao Wang, Hui Chen, Zijia Lin, Hengjun Pu, and Guiguang Ding. Repvit: Revisiting mobile cnn from vit perspective. CVPR, 2024.
  • [46] Hengli Wang, Rui Fan, Peide Cai, and Ming Liu. Pvstereo: Pyramid voting module for end-to-end self-supervised stereo matching. ICRA, 2021.
  • [47] Qiang Wang, Shaohuai Shi, Shizhen Zheng, Kaiyong Zhao, and Xiaowen Chu. FADNet: A fast and accurate network for disparity estimation. In ICRA, 2020.
  • [48] Qiang Wang, Shaohuai Shi, Shizhen Zheng, Kaiyong Zhao, and Xiaowen Chu. Fadnet++: Real-time and accurate disparity estimation with configurable networks. arXiv preprint arXiv:2110.02582, 2021.
  • [49] Xianqi Wang, Gangwei Xu, Hao Jia, and Xin Yang. Selective-stereo: Adaptive frequency information selection for stereo matching. 2024.
  • [50] Yan Wang, Zihang Lai, Gao Huang, Brian H. Wang, Laurens van der Maaten, Mark Campbell, and Kilian Q. Weinberger. Anytime stereo image depth estimation on mobile devices. In ICRA, 2019.
  • [51] Zhenyao Wu, ** Zhang, Song Wang, and Lili Ju. Semantic stereo matching with pyramid cost volumes. In ICCV, 2019.
  • [52] Bin Xu, Yuhua Xu, Xiaoli Yang, Wei Jia, and Yulan Guo. Bilateral grid learning for stereo matching networks. In CVPR, 2021.
  • [53] Gangwei Xu, Junda Cheng, Peng Guo, and Xin Yang. Attention concatenation volume for accurate and efficient stereo matching. In CVPR, 2022.
  • [54] Gangwei Xu, Xianqi Wang, Xiaohuan Ding, and Xin Yang. Iterative geometry encoding volume for stereo matching. In CVPR, 2023.
  • [55] Gangwei Xu, Yun Wang, Junda Cheng, **hui Tang, and Xin Yang. Accurate and efficient stereo matching via attention concatenation volume. TPAMI, 2023.
  • [56] Gangwei Xu, Huan Zhou, and Xin Yang. Cgi-stereo: Accurate and real-time stereo matching via context and geometry interaction. arXiv preprint arXiv:2301.02789, 2023.
  • [57] Haofei Xu and Juyong Zhang. Aanet: Adaptive aggregation network for efficient stereo matching. In CVPR, 2020.
  • [58] Peng Xu, Zhiyu Xiang, Chengyu Qiao, **gyun Fu, and Tianyu Pu. Adaptive multi-modal cross-entropy loss for stereo matching. 2024.
  • [59] Koichiro Yamaguchi, David McAllester, and Raquel Urtasun. Efficient joint segmentation, occlusion labeling, stereo and flow estimation. In ECCV, 2014.
  • [60] Guorun Yang, Hengshuang Zhao, Jian** Shi, Zhidong Deng, and Jiaya Jia. Segstereo: Exploiting semantic information for disparity estimation. In ECCV, 2018.
  • [61] Chengtang Yao, Yunde Jia, Huijun Di, Pengxiang Li, and Yuwei Wu. A decomposition model for stereo matching. In CVPR, 2021.
  • [62] Feihu Zhang, Victor Prisacariu, Ruigang Yang, and Philip H.S. Torr. Ga-net: Guided aggregation net for end-to-end stereo matching. In CVPR, 2019.
  • [63] Feihu Zhang, Xiaojuan Qi, Ruigang Yang, Victor Prisacariu, Benjamin Wah, and Philip Torr. Domain-invariant stereo matching networks. In ECCV, 2020.