Results and Limitations - Permutation-Invariant Structure-from-Motion

Our proposed invariant loss successfully addresses the failure case where the first input image contains a texturelessregion (e.g., a white wall), which previously caused registration failure for other views. After applying the invariant formulation, we observe that the predicted point maps remain identical regardless of image input order, demonstrating that the model has effectively learned permutation invariance.

Qualitative comparison of our proposed method and VGGT baseline

However, the overall point map quality is slightly degraded compared to the original baseline. We hypothesize two main reasons:

Incomplete convergence of the coordinate frame: Although permutation invariance is achieved, the coordinate frame still exhibits small fluctuations across runs, which may prevent precise alignment of views.
Scene-level instability: Some scenes fail to converge, resulting in coordinate frame drift. In these cases, different views from the same scene produce inconsistent point maps. These non-converging scenes are often characterized by the presence of mirrors, glass surfaces, or outdoor environments, which introduce depth ambiguities and errors in the predicted geometry.