Results

Does Factored Flow Help?

We first compare our factored prediction paradigm against alternative designs and no-flow baselines on static and dynamic scenes. We include two models trained with full 3D supervision with different numbers of training sequences (denoted as 3d-sup and 3d-sup++). Next, building upon the no-flow baseline 3d-sup, we introduce three additional variants that incorporate flow supervision using different formulations: (1) flow-projective, which computes flow explicitly from predicted camera poses and pointmaps via projective geometry; (2) flow-tracking, which adopts a VGGT-style tracking head based on pairwise patch features; (3) flow-factored, which applies our proposed factored flow prediction formulation.

Static Scene Results

On ScanNet++, our factored flow prediction model (flow-factored) significantly outperforms the no-flow baseline (3d-sup), while outperforming other alternatives that leverage flow supervision. It even performs highly comparable with the fully-3D-supervised baseline (3d-sup++).

Table1. Does factored flow prediction help visual geometry learning on statics cenes? On ScanNet++, our factored flow prediction model (flow-factored) significantly outperforms the no-flow baseline (3d-sup), while outperforming other alternatives that leverage flowsupervision. It even performs highly comparable with the fully-3D-supervised baseline (3d-sup++).

The results in Tab. 1 show that flow-factored outperforms both flow-supervised alternatives (flow-projective and flow-tracking) as well as the no-flow baseline, achieving higher camera pose accuracy and better geometric quality. Notably, flow-tracking provides almost no improvement in pose accuracy and geometric quality, suggesting that supervising flow prediction from pairwise patch features does not meaningfully benefit visual geometry learning. Moreover, flow-projective even degrades performance on both pose metrics and geometry quality, indicating that supervising flow computed from explicit camera and geometry predictions may suffer from instability and thus harm learning. Compared to 3d-sup++, the no-flow baseline trained with full 3D supervision, flow-factored achieves comparable pose accuracy and geometry quality while even slightly improving camera center accuracy and reconstruction MSE. These results demonstrate that our proposed factored flow prediction serves as an effective approach of scaling visual geometry learning on static scenes, especially when dense 3D labels are insufficient.

Dynamic Scene Results

We train seven model variants on OmniWorld and SpatialVID, where OmniWorld provides 3D supervision and SpatialVID offers flow supervision. Consistent with our findings on static scenes, flow-factored with factored flow prediction considerably outperforms the no-flow baseline (3d-sup) and other flow-supervised alternatives. Also, factored flow prediction brings consistent gains by using more data.

Table 2. Does factored flow prediction help dynamic visual geometry learning? We train seven model variants on OmniWorld and SpatialVID, where OmniWorld provides 3D supervision, and SpatialVID offers flow supervision. Consistent with our findings on static scenes, flow-factored with factored flow prediction considerably outperforms the no-flow baseline (3d-sup) and other flow-supervised alternatives. Also, factored flow prediction brings consistent gains by using more data.
Figure 1. Factored flow prediction aids visual geometry learning. Compared with the baseline (3d-sup) and alternative formulations that use flowsupervision (flow-projective, flow-tracking), Flow3r (flow-factored) yields more accurate dynamic-scene geometry and further improves with additional training data. This shows the effectiveness of factored flow prediction for geometry learning.

The results on dynamic scenes are presented in Tab. 2, and they exhibit a performance pattern consistent with our observations on static scenes: our factored flow prediction significantly improves over the no-flow baseline (3d-sup), whereas flow supervision via VGGT’s tracking head provides negligible gains, and supervising flow computed from explicit camera and geometry predictions continues to degrade performance. Also, the results reveal that increasing the number of training sequences to a larger scale (e.g., 10× or 20× the amount used in the 3d-sup no-flow baseline) yields consistent improvements. Notably, the flow-factored++ variant– trained with 20K unlabeled dynamic video sequences in addition to 1K 3D-labeled sequences– surpasses 3d-sup++, which uses 3K labeled sequences for full 3D supervision. These results demonstrate that supervising flow through our factored prediction formulation can effectively scale visual geometry learning, leveraging large quantities of unlabeled, dynamic video data.

In Fig. 1, we qualitatively compare our method with baselines under full 3D supervision and those with different flow formulations. Notably, flow-factored significantly improves reconstruction quality over the no-flow baseline (3d-sup) while outperforming other flow-supervised model variants. Also, leveraging more data brings non-trivial gains in reconstruction quality, and the resulting model performs comparably with or even surpasses the baseline with the largest number of training sequences under full 3D supervision (3d-sup++).

Scalability: Large-Scale Training with Unlabeled Videos

Here we scale the training of an off-the-shelf large visual geometry network (VGGT) by leveraging our factored flow prediction strategy with unlabeled dynamic data. We evaluate performance using pose accuracy and reconstruction metrics in four dynamic datasets: Kinetics700, Epic-Kitchens, Sintel and Bonn.

We compare Flow3r (and Flow3r*) with CUT3R, VGGT, and π3. Since the official π3 checkpoint was trained with unreleased dynamic-scene data, we re-implement π3 and train a model (denoted as π3) using the same training data as our method to ensure a fair comparison. CUT3R is trained on a considerably larger pool of data (30+ datasets spanning diverse domains), whereas VGGT and our π3* baseline are trained on the same data as Flow3r. We evaluate performance using pose accuracy and reconstruction metrics in four dynamic datasets: Kinetics700, Epic-Kitchens, Sintel, and Bonn, using MegaSAM to compute ‘ground truth’ from dense videos on the first two. Following prior work, we report Relative Pose Error, including RPE (trans) and RPE (rot). We assess 3D geometry using mean squared error (MSE), which captures overall geometric fidelity, and F-score to evaluate the accuracy–completeness trade-off.

Table 3. Comparison of dynamic datasets. Best, second-best, and third-best results are highlighted in light red, orange, and yellow, respectively. Flow3r outperforms other methods in both camera pose estimation and scene reconstruction, demonstrating its effectiveness.
Figure 2. Qualitative results on in-the-wild videos. While other methods fail to reconstruct the scene accurately and often align it to a moving object (top row), Flow3r robustly recovers dynamic scenes from in-the-wild videos, even under complex motion.
Figure 3. Comparison of utilizing a large-scale unlabeled dataset. Compared with Flow3r*, Flow3r more accurately predicts dense flow and geometry on dynamic datasets, demonstrating the effectiveness of using large-scale unlabeled data via factored flow prediction.

We report our evaluations in Tab. 3. For both pose estimation and scene reconstruction, Flow3r consistently outperforms baselines that use comparable training data, e.g., VGGT and π3∗. Although the official π3 model is trained on more data, Flow3r performs comparably on most metrics and even outperforms π3 on a few metrics, e.g., pose accuracy on Epic-Kitchens and reconstruction quality on Sintel, demonstrating the benefit from leveraging unlabeled video data via our factored flow prediction. We include qualitative results on in-the-wild videos in Fig. 2, where Flow3r infers cleaner and more accurate scene structure than baselines. We also observe that Flow3r consistently outperforms Flow3r* by a large margin, demonstrating the effectiveness of scaling with large amounts of unlabeled data. A visual comparison between the two models in Fig. 3 further shows significant improvements in both the predicted flow fields and the scene geometry.