Related Work


The task of 3D reconstruction from sparse views presents significant challenges that many existing methodologies struggle to address effectively. Techniques such as Neural Radiance Fields (NeRF) [1] and 3D Gaussian Splatting (3DGS) [2] typically require dense image sets—ranging from tens to hundreds—along with precise camera pose information. This requirement often renders the data capture process impractical for everyday applications. Additionally, these methods involve computationally intensive per-scene optimizations, which are not only time-consuming but also difficult to apply in scenarios with limited view availability where camera poses may be imprecise.

Several innovative approaches have been developed to tackle these issues:

For instance, Scene Representation Transformer (SRT) [3] demonstrates the capability to process a few posed or unposed images from novel real-world scenes, constructing a latent scene representation using semantic priors learnt implicitly by a CNN + Transformer based encoder. At inference, input images are passed through the model in a single forward pass to get a set-latent scene representation from the encoder, which can then be queried to produce a rendering for any novel view using the learnt decoder. However, this method is completely implicit, making it difficult to obtain an explicit representation for downstream tasks, and has been shown to have an inaccurate 3D understanding producing blurry/suboptimal renderings.

PixelSplat [4] employs a novel approach using just a pair of images along with their camera metadata. It predicts a predefined number of 3D Gaussians for each pixel from the first view and renders these Gaussians from new perspectives. This method involves passing images through a pre-trained feature encoder, incorporating depth information via an epipolar attention layer, and further refining through subsequent layers to produce feature maps. These maps are then used by a neural network to predict the Gaussian parameters. This method only works for the two-view setting, requires the relative camera pose as input, requires some overlap between the input views to use the epipolar constraints effectively, and struggles with unseen regions due to the constrained coverage of the pixel-aligned Gaussians.

MVSplat [5] introduces an innovative approach to sparse 3D reconstruction by leveraging a cost-volume-based method inspired by stereo techniques. It extracts multi-view features using a Transformer with self- and cross-attention layers for inter-view information exchange. Per-view cost volumes are constructed via plane sweeping and refined alongside Transformer features using a 2D U-Net with cross-view attention, yielding per-view depth maps. These depth maps are unprojected to 3D and combined through a deterministic union operation to form 3D Gaussian centers, with opacity, covariance, and color parameters jointly predicted. This method too requires the camera poses as input and is constrained by the pixel-aligned predictions which means that it suffers from similar drawbacks as PixelSplat.

Finally, Splatt3R [6] builds on top of MASt3R and demonstrates that a simple architecture modification and a well-chosen training loss can yield strong novel view synthesis results. They encode each image using a vision transformer encoder and pass them to a decoder that performs cross-attention across images. While MASt3R has two prediction heads for 3D point prediction and feature matching, they introduce a third head to predict covariances, spherical harmonics, opacities, and mean offsets, allowing them to create a complete Gaussian primitive for each pixel. During training, they only train the Gaussian prediction head, using a pre-trained MASt3R model for the other parameters. Since this work builds on MASt3R that itself builds on top of DUSt3R which does not require camera poses as input, this method too is not dependent on camera information. However, the pixel-aligned predictions yet again restricts the pipeline from accurately modeling unseen regions.

Overall, many works have explored the challenging problem of sparse and generalizable 3D reconstruction but almost all of them either require camera pose information or produce per-pixel predictions for the input views which means that they are severely restricted in terms of the density of their reconstruction and are prone to fail for the unseen regions of the objects/scenes. We endeavor to address these problems using our method while maintaining competitive results for the sparse view reconstruction problem statement.

References

  1. Mildenhall, Ben, et al. “NeRF: Representing scenes as neural radiance fields for view synthesis.” Communications of the ACM 65.1 (2021): 99-106.
  2. Kerbl, Bernhard et al. “3D Gaussian Splatting for Real-Time Radiance Field Rendering.” ACM Transactions on Graphics (TOG) 42 (2023): 1 – 14.
  3. Sajjadi, Mehdi S. M. et al. “Scene Representation Transformer: Geometry-Free Novel View Synthesis Through Set-Latent Scene Representations.” 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2021): 6219-6228.
  4. Charatan, David et al. “pixelSplat: 3D Gaussian Splats from Image Pairs for Scalable Generalizable 3D Reconstruction.” ArXiv abs/2312.12337 (2023): n. Pag.
  5. Chen, Y., Xu, H., Zheng, C., Zhuang, B., Pollefeys, M., Geiger, A., Cham, T.J., Cai, J. Mvsplat: Efficient 3d gaussian splatting from sparse multi-view images. arXiv preprint arXiv:2403.14627 (2024)
  6. Brandon Smart, Chuanxia Zheng, Iro Laina, and Victor Adrian Prisacariu. Splatt3r: Zero-shot gaussian splatting from uncalibarated image pairs. arXiv preprint arXiv:2408.13912, 2024.
  7. Szymanowicz, Stanislaw et al. “Splatter Image: Ultra-Fast Single-View 3D Reconstruction.” ArXiv abs/2312.13150 (2023): n. Pag.
  8. Yu, Alex et al. “pixelNeRF: Neural Radiance Fields from One or Few Images.” 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020): 4576-4585.
  9. Kani, Bharath Raj Nagoor et al. “UpFusion: Novel View Diffusion from Unposed Sparse View Observations.” ArXiv abs/2312.06661 (2023): n. pag.