(a) Overview of SparseAGS: Given estimated camera poses from off-the-shelf models, our method iteratively reconstructs 3D and optimizes poses leveraging diffusion priors. (b) Detailed View of Each Component: We use rendering loss and multi-view SDS loss for 3D reconstruction while the rendering loss is propagated back to refine camera poses. At the end of each reconstruction
iteration, we identify outliers by checking if their involvement in 3D inference yields larger errors in other views, implying the inconsistency of their poses with others.

Multi-view 3D reconstruction with Diffusion Priors. The 3D representation lies at the core of our approach, and reconstructing 3D from sparse-view images is itself an open research problem. We build our method on DreamGaussian, an efficient image-to-3D method.

DreamGaussian has a few limitations: (1) It only supports 3-DoF camera parametrization (radius, azimuth, and elevation), which cannot well represent the 6-DoF cameras in the real world. This property is determined by the diffusion model Zero-1-to-3 (that adopts a 3-DoF camera), which provides priors over novel views given input images. (2) DreamGaussian only uses one image as input, so it cannot use the visual cues from multiple images.
We make a few modifications to DreamGaussian accordingly. First, on top of the vanilla Zero-1-to-3, we replaced its camera conditioning module with a novel 6-DoF one. We fine-tuned it using real-world images (it was initially trained only on synthetic data). Second, we extend DreamGaussian to support multi-view images as input by introducing a multi-view-aware SDS loss. Lastly, we developed custom CUDA kernels in Gaussian Splatting for differentiable camera pose optimization to deal with noisy cameras. With the modifications above, our 3D reconstruction approach can work on multiple real-world images with 6-DoF camera poses and can adjust camera poses during reconstruction. We maintain the efficiency of DreamGaussian by batching operations for multi-view images.
In the following, we show that our 6-DoF Zero-1-to-3 enables manipulating camera in-plane rotations, which cannot be modeled by the vanilla Zero-1-to-3.

We also show that our 6-DoF camera parametrization is more expressive than the 3-DoF one of the vanilla Zero-1-to-3.
