FormerStereo

Learning Representations from Foundation Models for Domain Generalized Stereo Matching


Yongjian Zhang1     Longguang Wang2    Kunhong Li1    Yun Wang3    Yulan Guo1†   
1SYSU              2AUAF              3CUHK            
† corresponding author       
ECCV 2024

FormerStereo is only trained on SceneFlow with 35k stereo pairs, enhancing the generalized performance for most stereo baseline like PSMNet and RAFT-Stero. Our proposed FormerStereo is equipped with the following features:

  • adapted to a wide range of foundation models: DINOv2, DAM and SAM.
  • adapted to a wide range of stereo aggregation baselines: PSMNet, GwcNet, CFNet and RAFT-Stereo.
  • optimal out-of-domain generalization performance on realist datasets.
We also update our framework with a better adaptor design. We will release the improved baseline in the near future.

Abstract

State-of-the-art stereo matching networks trained on in-domain data often underperform on cross-domain scenes. Intuitively, leveraging the zero-shot capacity of a foundation model can alleviate the cross-domain generalization problem. The main challenge of incorporating a foundation model into stereo matching pipeline lies in the absence of an effective forward process from single-view coarse-grained tokens to cross-view fine-grained cost representations. In this paper, we propose FormerStereo, a general framework that integrates the Vision Transformer (ViT) based foundation model into the stereo matching pipeline. Using this framework, we transfer the all-purpose features to matching-specific ones. Specifically, we propose a reconstruction-constrained decoder to retrieve fine-grained representations from coarse-grained ViT tokens. To maintain cross-view consistent representations, we propose a cosine-constrained concatenation cost (C4) space to construct cost volumes. We integrate FormerStereo with state-of-the-art (SOTA) stereo matching networks and evaluate its effectiveness on multiple benchmark datasets. Experiments show that the FormerStereo framework effectively improves the zero-shot performance of existing stereo matching networks on unseen domains and achieves SOTA performance.

Data Coverage

Our FormerStereo is only trained on SceneFlow with 35k stereo pairs, and quantitatively evaluate on the training subset of four popular realistic datatsets including KITTI 2015, KITTI 2012, Middlebury and ETH3D. We also qualitatively validate the improvement on Driving Stereo and Oxford Robot Car.

Zero-shot Generalization Estimation

Former-integrated methods are better than the previously best model HVT.

pipeline

Framework

The framework of FormerStereo is shown below. The stereo images are first fed into a frozen ViT backbone to extract features. These ViT features are then transformed into fine-grained representations by the feature transformation module. We use these fine-grained representations to construct the C4 space, which is subsequently converted to disparity by any cost aggregation algorithm.

pipeline

Citation

@inproceedings{formerstereo,
  title={Learning Representations from Foundation Models for Domain Generalized Stereo Matching},
  author={Yongjian Zhang, Longguang Wang, Kunhong Li, Yun Wang, and Yulan Guo},
  booktitle={ECCV},
  year={2024}
}