Challenges with Existing Methods

In almost all multi-view Structure-from-Motion pipelines, the coordinate frame is usually anchored to the first image in the sequence. However, this assumption breaks when the first view is uninformative — for example, when it’s just a white wall with no texture. This can lead to significant issues: the final reconstruction becomes unstable, and the system might completely fail if the input order is suboptimal.

Here we evaluate a state-of-the-art reconstruction model under two conditions. We show three input images from the same scene, taken from slightly different viewpoints. The first image captures a textured wall with a bed and some furniture — it contains good semantic and geometric cues. The second view moves slightly to the side, revealing a wardrobe with detailed textures. And the third view is almost entirely a white wall, which is largely textureless and uninformative.

In our experiments, we compare two settings:

  1. When the white wall is not used as the first image.
  2. When the white wall is the first image:
    • On the left, the white wall is not the first view, and the model reconstructs the scene correctly.
    • On the right, when the white wall is placed first, the model fails to register the other views. You can see multiple copies of the wall in the image. 

Obviously, the result shows that changing the input order — putting the white wall first — significantly degrades the reconstruction. Therefore, our goal is to remove dependency on image input order to make SfM more robust and generalizable.