Related Work

Dense-correspondence Learning

Dense correspondence learning focuses on estimating pixel-level matches across views, which serves as the basis for recovering camera motion and 3D structure. Early work primarily focused onlocal optical flow, providing per-pixel motion between adjacent frames but failing under large viewpoint changes. Building upon these ideas, recent methods learn geometry-aware dense correspondences that remain stable across wide baselines. For example, DKM refines coarse predictions through hierarchical warping to capture fine-grained geometric alignment, while RoMa leverages pretrained visual features to achieve semantically consistent matching across diverse scenes. Transformer-based architectures such as UFM further unify dense matching and optical-flow reasoning, yielding high-quality correspondences that generalize across domains. Beyond pairwise alignment, dense correspondence learning has expanded to long-range video tracking, with recent models such as CoTracker jointly reasoning over multiple frames to achieve state-of-the-art occlusion-robust tracking. Our approach builds on dense correspondence learning but goes beyond matching: we introduce a factored flow formulation that explicitly links source geometry and target camera pose. This factorization allows the model to infer geometry-aware flow that is consistent with 3D structure, enabling accurate reconstruction and motion estimation in challenging in-the-wild dynamic scenes.

Correspondence-driven Reconstruction

Building on (dense) correspondence estimation, these methods recover 3D structure and camera motion directly from learned matches. Classical Structure-from-Motion pipelines estimate camera poses and scene structure from sets of images by detecting local features, computing pairwise correspondences, and jointly optimizing camera parameters and 3D points through global bundle adjustment. Extending correspondence-based reconstruction to dynamic scenes, visual SLAM systems track features across frames to jointly estimate camera trajectories and scene geometry. Recent approaches, including Robust-CVD, CasualSAM, and MegaSAM, further incorporate monocular depth priors or single-view geometric supervision to enhance robustness under motion and occlusion. However, these methods remain optimization-based, requiring per-video refinement and lacking feed-forward efficiency.

Feed-forward Visual Geometry Learning

Recent efforts aim to replace traditional optimization pipelines with feed-forward networks that directly predict visual geometry from images. DUSt3R first demonstrated that dense pointmaps can be estimated from image pairs within a shared coordinate system, enabling efficient two-view reconstruction. MASt3R further improves this paradigm by introducing a learned matching head for better correspondence reasoning, while DiffusionSfM and VGGT generalize to multi-view settings, jointly estimating camera parameters and scene structure. Subsequent works such as MonST3R, CUT3R, and StreamVGGT extend this formulation to dynamic scenes, learning temporally consistent geometry across video frames. However, these models rely on labeled 3D or camera data, which are not easily scalable. In contrast, our method enables scalable feed-forward learning of dynamic visual geometry using factored flow prediction, allowing training on unlabeled real-world videos.