Introduction

What is Structure from Motion (SfM)?

SfM aims to reconstruct both the 3D structure of a scene and the camera poses from a set of input images. It has broad applications—from robotics to Augmented Reality.

Related Work

In recent years, deep learning-based SfM methods have emerged. For example, DiffusionSfM from CVPR 2025 uses a diffusion model to jointly predict camera poses and 3D points in an unordered setting.

DiffusionSfM, a diffusion-based framework for joint pose and structure estimation from unordered images


Another method, VGGT, follows a regression-based approach and uses transformers to directly output the 3D scene from multi-view inputs.While these methods improve robustness and scalability, they still implicitly define the coordinate frame using the first image.

VGGT, a regression-based model that directly predicts 3D scene structure using transformers