FormerStereo is only trained on SceneFlow with 35k stereo pairs, enhancing the generalized performance for most stereo baseline like PSMNet and RAFT-Stero. Our proposed FormerStereo is equipped with the following features:
State-of-the-art stereo matching networks trained on in-domain data often underperform on cross-domain scenes. Intuitively, leveraging the zero-shot capacity of a foundation model can alleviate the cross-domain generalization problem. The main challenge of incorporating a foundation model into stereo matching pipeline lies in the absence of an effective forward process from single-view coarse-grained tokens to cross-view fine-grained cost representations. In this paper, we propose FormerStereo, a general framework that integrates the Vision Transformer (ViT) based foundation model into the stereo matching pipeline. Using this framework, we transfer the all-purpose features to matching-specific ones. Specifically, we propose a reconstruction-constrained decoder to retrieve fine-grained representations from coarse-grained ViT tokens. To maintain cross-view consistent representations, we propose a cosine-constrained concatenation cost (C4) space to construct cost volumes. We integrate FormerStereo with state-of-the-art (SOTA) stereo matching networks and evaluate its effectiveness on multiple benchmark datasets. Experiments show that the FormerStereo framework effectively improves the zero-shot performance of existing stereo matching networks on unseen domains and achieves SOTA performance.
Our FormerStereo is only trained on SceneFlow with 35k stereo pairs, and quantitatively evaluate on the training subset of four popular realistic datatsets including KITTI 2015, KITTI 2012, Middlebury and ETH3D. We also qualitatively validate the improvement on Driving Stereo and Oxford Robot Car.
Former-integrated methods are better than the previously best model HVT.
The framework of FormerStereo is shown below. The stereo images are first fed into a frozen ViT backbone to extract features. These ViT features are then transformed into fine-grained representations by the feature transformation module. We use these fine-grained representations to construct the C4 space, which is subsequently converted to disparity by any cost aggregation algorithm.
@inproceedings{formerstereo, title={Learning Representations from Foundation Models for Domain Generalized Stereo Matching}, author={Yongjian Zhang, Longguang Wang, Kunhong Li, Yun Wang, and Yulan Guo}, booktitle={ECCV}, year={2024} }