Experimental Setup
Datasets. We primarily evaluate our method on a real-world multi-view
object-centric dataset NAVI. This dataset includes high-quality foreground
masks, precise camera poses, and 3D meshes. For each of the 35 objects in
NAVI, we randomly select 5 multi-view sequences for pose estimation and reconstruction.
Baselines. To evaluate camera pose accuracy, we select three sparse-view pose
estimation baseline methods: RelPose++, Ray Diffusion, and DUSt3R. The first two are trained exclusively on CO3D, while DUSt3R is trained on a mixture of eight datasets, representing different levels of precision in initial camera poses. Our method initializes and improves the pose estimates from these
baselines, and we also compare with SPARF, a sparse-view pose-NeRF co-optimization method. To evaluate novel view synthesis, we mainly compare our method with unposed sparse-view reconstruction approaches, LEAP and UpFusion. We conduct experiments with varying numbers of input images (N = 6, 8, 10, 16).
Evaluation
Qualitative Comparison of Camera Pose Accuracy
We compare DiffusionSfM on pose accuracy with SPARF. Our method can deal with initial camera poses with large errors.

Qualitative Comparison of Novel View Synthesis
We compare DiffusionSfM on novel view synthesis with LEAP. Our method can preserve high-quality details from input images. Please refer to our paper for more comparisons with SPARF and UpFusion.

Ablation Study
