FAR: Flexible, Accurate and Robust 6DoF Relative Camera Pose Estimation

Figure 1. Precise and robust 6DoF pose estimation. Correspondence Estimation + Solver methods (here LoFTR, RANSAC) produce precise outputs for moderate rotations, but are not robust to large rotations (left), and cannot produce translation scale. Learning-based methods (here LoFTR with 8-Point ViT head) produce scale (right) and are more robust, but lack precision (left). FAR leverages both for precise and robust prediction, including scale.

Abstract

Estimating relative camera poses between images has been a central problem in computer vision. Methods that find correspondences and solve for the fundamental matrix offer high precision in most cases. Conversely, methods predicting pose directly using neural networks are more robust to limited overlap and can infer absolute translation scale, but at the expense of reduced precision. We show how to combine the best of both methods; our approach yields results that are both precise and robust, while also accurately inferring translation scales. At the heart of our model lies a Transformer that (1) learns to balance between solved and learned pose estimations, and (2) provides a prior to guide a solver. A comprehensive analysis supports our design choices and demonstrates that our method adapts flexibly to various feature extractors and correspondence estimators, showing state-of-the-art performance in 6DoF pose estimation on Matterport3D, InteriorNet, StreetLearn, and Map-Free Relocalization.

Approach

Figure 3. Overview. Given dense features and correspondences, FAR’s Transformer produces camera poses (in square boxes □) through a transformer (round box ▢) and classical solver (round box ▢). In the first round, the solver produces a pose T_s. FAR’s pose transformer averages this with its own prediction T_t via weight w, to yield the round 1 pose T₁. T₁ pose serves as a prior for the classic solver, which produces an updated pose T_u. This is combined with an additional estimate of T_t and weight w to produce the final result T. With few correspondences, T₁ helps solver output, while the network learns to weigh Transformer predictions more heavily; with many correspondences, solver output is often good, so the network relies mostly on solver output.

BibTeX

@inproceedings{Rockwell2024,
        author = {Rockwell, Chris and Kulkarni, Nilesh and Jin, Linyi and Park, Jeong Joon and Johnson, Justin and Fouhey, David F.},
        title = {FAR: Flexible, Accurate and Robust 6DoF Relative Camera Pose Estimation},
        booktitle = {CVPR},
        year = 2024
    }

FAR: Flexible, Accurate and Robust 6DoF Relative Camera Pose Estimation

CVPR 2024 (Highlight)

Abstract

Approach

BibTeX