Methods

How Can Flow Supervise Geometry?

Modern visual-geometry networks first encode each image into patch tokens and a camera token (a) using a multi-view transformer backbone. Based on these latent features, there are several ways to predict dense correspondences between frames. Traditional correspondence heads (b) infer flow directly from patch features, relying purely on visual appearance and ignoring the underlying scene geometry. Alternatively, one may compute flow by explicitly projecting predicted 3D points into another view using decoded camera poses (c); however, this approach assumes static scenes and is highly sensitive to geometric prediction errors. In contrast, our factored flow mechanism (d) combines the geometry latents from the source view with the camera latents from the target view and decodes correspondences directly in latent space. This design yields geometry-aware flow, improves robustness, and naturally extends to dynamic scenes.

Here we propose Flow3r, which predicts visual geometry using factored flow supervision, enabling scalable geometry learning from unlabeled videos. Each input image is encoded and processed by the multi-view transformer to produce camera tokens and patch tokens. For data with dense geometry and pose labels, we directly supervise the patch tokens and camera tokens with the corresponding labels. For unlabeled datasets without geometry and supervision, we predict flow between two frames in a factorized manner, supervised by an off-the-shelf 2D flow prediction model. To obtain the factored flow, we fuse the patch features of one frame with the camera features of the other, and decode the fused representation through the DPT head to produce dense flow predictions.