The task of ‘visual geometry inference’, i.e., recovering the 3D structure of a (static or dynamic) scene from multi-view input images, has undergone a paradigm shift– evolving from classical optimization-based methods to recent data-driven predictors that can directly output the geometry and pose corresponding to the input images. The success of such efficient feedforward systems, however, has crucially relied on multi-view training data with dense geometry and camera pose supervision. Unfortunately, this supervision may not be easily available across all settings of interest, e.g., for dynamic scenes in the wild or domains like egocentric videos, and existing visual geometry prediction methods do not generalize well to such scenarios. More broadly, unlike self-supervised learning objectives common for training LLMs and vision transformers, the reliance on dense geometry and pose labels prevents truly large-scale visual geometry learning.
In this work, we take a step towards scalable learning of multi-view models and present Flow3r, a framework to guide visual geometry learning from unlabeled videos, i.e., without any explicit geometry or pose supervision. Instead, Flow3r leverages a readily available supervisory signal that is a cornerstone of classical (and recent) optimization-based multi-view methods– (dense) correspondences across images. In particular, we are inspired by the progress in inferring dense correspondences or pixel tracks for generic image pairs and videos, and seek to unlock scalable learning by incorporating such (2D) flow as auxiliary supervision for (3D) visual geometry models. The key technical question we seek to answer is: ‘how can flow be effectively leveraged for supervising visual geometry prediction?’.
We are not the first to consider flow supervision for guiding visual geometry learning. Indeed, the seminal VGGT work adds a ‘tracking’ module that uses local features from two images to predict a flow between them, and uses this as an auxiliary training objective. However, as we show later, this merely encourages the corresponding features to be visually discriminative but does not directly aid the learning of pose or geometry. Our key insight is that to guide geometry learning, the design of the flow prediction module should be asymmetric. We build on the observation that for static scenes, the flow between a source and a target image can be induced only via the geometry of the source image (pointmaps in a global coordinate) and the camera pose of the target. Building on this, we propose to incorporate a factored flow prediction module in visual geometry models. Specifically, such models typically compute ‘local’ patchwise features that later predict geometry as well as a global per-image token that infers camera pose. Our flow prediction module is designed to compute flow between two images using only the global pose token for the target, along with the patchwise tokens for the source.
We find that our factored flow prediction helps better supervise pose and geometry learning compared to the symmetric design adopted by previous works, while also allowing robust prediction and applications to dynamic scenes, unlike a projection-based flow inference. Flow3r integrates such factored flow prediction and leverages 800k unlabeled videos as supervision in addition to existing (labeled) 3D datasets. We show that this allows Flow3r to outperform prior visual-geometry systems, in particular improving over them in in-the-wild dynamic videos where labeled data is scarce. More broadly, we believe that by allowing the extraction of supervisory signal from unlabeled videos (although leveraging off-the-shelf 2D flow prediction), Flow3r represents a step towards large-scale visual geometry learning without large-scale supervision.
