Abstract
We present a simple baseline for directly estimating the
relative pose (rotation and translation, including scale) between two images. Deep methods have recently shown
strong progress but often require complex or multi-stage architectures. We show that a handful of modifications can be
applied to a Vision Transformer (ViT) to bring its computations close to the Eight-Point Algorithm. This inductive
bias enables a simple method to be competitive in multiple
settings, often substantially improving over the state of the
art with strong performance gains in limited data regimes.