Gaussian Splatting for Unconstrained Photo Collections (Splatfacto-W)

We adopt Splatfacto-W, a variant of Gaussian Splatting specifically designed for unconstrained image and video collections captured in natural settings.

Simple 3D GS scene trained vs Splatfacto-W variant

Background: Original Gaussian Splatting

In Gaussian Splatting, a 3D scene is represented by a collection of 3D Gaussian primitives, where each Gaussian models a small region of space with its position (μ), covariance (Σ), opacity (α), and color (c). Unlike traditional point clouds or meshes, these Gaussians are continuous and differentiable, making them well-suited for optimization via gradient-based methods.

3D Covariance and Projection

Each Gaussian’s 3D covariance matrix (Σ) defines its shape and spatial extent, modeling how much influence it has on the surrounding region. During rendering, this 3D covariance is projected into the 2D image space using a view-dependent transformation:

Here, W is the world-to-camera transformation, and J is the Jacobian of the projection, which approximates how the 3D Gaussian deforms when projected into 2D space.

Color Representation and α-Blending

Each Gaussian carries a color (c) represented via third-order spherical harmonics, allowing it to capture view-dependent appearance (e.g., specularities and shading). The influence of a Gaussian on a pixel is computed using a 2D Gaussian function:

where σi​ is the contribution of the i-th Gaussian to the pixel at location r.

The final pixel color is computed using alpha blending,

Here, r represents the position of a pixel, and Gr denotes the sorted Gaussian points associated with that pixel.

This blending process ensures smooth compositing and natural depth-aware rendering.

GS in the Wild

‘GS in the wild’ introduces three key innovations to handle real-world variability:

  1. Neural Color Fields
    Each 3D Gaussian is assigned a learned color feature, which is decoded via a small MLP conditioned on the view direction and local appearance, allowing for view-dependent color synthesis.
  2. Per-Image Appearance Embeddings
    A latent embedding is learned for each image to model global lighting, exposure, and sensor variation. This provides a way to decouple scene geometry from photometric conditions.
  3. Spherical Harmonics Background Model
    To handle complex outdoor backgrounds (e.g., sky, trees, distant objects), a low-frequency spherical harmonics field is used to model background appearance separately from foreground geometry.

Final Rendering output pipeline for GS visualization

Using a 3D gaussian scene, the variant predicts the view-dependent color of each Gaussian using the Appearance Model. These Gaussians are then rasterized to render the foreground of the scene. Simultaneously, the Background Model predicts background appearance based solely on ray directions, using a spherical harmonics representation. Finally, the foreground and background are composited via alpha blending to produce the final rendered image.