Methods - Permutation-Invariant Structure-from-Motion

We use a regression model to predict the point map here. First, Each input image is processed by a pretrained DINO backbone to extract spatial tokens (patch features). We add a shared learnable camera token for all views. Then we put the token into a transformer. The core of the transformer alternates between global attention across all tokens from all views and frame-wise attention within each image independently. After attention layers, each frame’s tokens are passed to a DPT head, it outputs 3D point, which assign each pixel in the input image a predicted 3D coordinate in space.

In existing methods, the loss is directly the L₂ norm of the predicted point map and the ground truth. To make the model invariant to the ordering of input images and free from arbitrary coordinate frame anchoring, we introduce a similarity-invariant loss. Given predicted point map and ground-truth point , we compute the alignment parameters that minimize the distance between two point maps. Then, we align the predicted points as the similarity transformation of the original predicted points, and the final invariant point map loss is defined as the L₂ norm of the aligned predicted pointmap and the ground truth.