Related Work


The task of 3D reconstruction from sparse views presents significant challenges that many existing methodologies struggle to address effectively. Techniques such as Neural Radiance Fields (NeRF) [1] and 3D Gaussian Splatting (3DGS) [2] typically require dense image sets—ranging from tens to hundreds—along with precise camera pose information. This requirement often renders the data capture process impractical for everyday applications. Additionally, these methods involve computationally intensive per-scene optimizations, which are not only time-consuming but also difficult to apply in scenarios with limited view availability where camera poses may be imprecise.

Several innovative approaches have been developed to tackle these issues.

For instance, Scene Representation Transformer (SRT) [3] demonstrates the capability to process a few posed or unposed images from novel real-world scenes, constructing a latent scene representation and semantic information which is learned implicitly by a CNN + Transformer based model that is decoded into 3D videos in real-time. At inference, input images passed through model in single forward pass to get set-latent scene representation, then can query for any novel view.

PixelSplat [4] employs a novel approach using just a pair of images along with their camera metadata. It predicts a predefined number of 3D Gaussians for each pixel from the first view and renders these Gaussians from new perspectives. This method involves passing images through a pre-trained feature encoder, incorporating depth information via an epipolar attention layer, and further refining through subsequent layers to produce feature maps. These maps are then used by a neural network to predict the Gaussian parameters.

SplatterImage [5] introduces a more streamlined approach, learning a single 3D Gaussian per pixel. This method leverages an image-to-image neural network to predict each pixel’s Gaussian parameters, including opacity, color, mean, and covariance. Furthermore, SplatterImage extends this methodology to handle dual-view inputs. It constructs a splatter image representation for each view, which are then intelligently unified, demonstrating the method’s adaptability to scenarios with sparse input data.

References

  1. Mildenhall, Ben, et al. “NeRF: Representing scenes as neural radiance fields for view synthesis.” Communications of the ACM 65.1 (2021): 99-106.
  2. Kerbl, Bernhard et al. “3D Gaussian Splatting for Real-Time Radiance Field Rendering.” ACM Transactions on Graphics (TOG) 42 (2023): 1 – 14.
  3. Sajjadi, Mehdi S. M. et al. “Scene Representation Transformer: Geometry-Free Novel View Synthesis Through Set-Latent Scene Representations.” 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2021): 6219-6228.
  4. Charatan, David et al. “pixelSplat: 3D Gaussian Splats from Image Pairs for Scalable Generalizable 3D Reconstruction.” ArXiv abs/2312.12337 (2023): n. Pag.
  5. Szymanowicz, Stanislaw et al. “Splatter Image: Ultra-Fast Single-View 3D Reconstruction.” ArXiv abs/2312.13150 (2023): n. Pag.
  6. Yu, Alex et al. “pixelNeRF: Neural Radiance Fields from One or Few Images.” 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020): 4576-4585.
  7. Kani, Bharath Raj Nagoor et al. “UpFusion: Novel View Diffusion from Unposed Sparse View Observations.” ArXiv abs/2312.06661 (2023): n. pag.