Overview

Dataset

We evaluate our method on 8 real-world equirectangular video walks captured using an Insta360 camera mounted on a handheld tripod. These sequences represent natural outdoor motion with varying lighting conditions and scene geometry.

Evaluation Setup

Gaussian Splatting (GS) reconstructions are evaluated on a held-out split, with 10% of images reserved for evaluation. All pipelines are trained and evaluated using the same splits to ensure fair comparison.

Pipelines Compared

  • Stella-VSLAM (baseline)
  • Rig-based SfM
  • Rig-based SfM with depth priors

Metrics

Reconstruction quality is evaluated using standard image-based metrics:

  • PSNR
  • SSIM
  • LPIPS

Final Metrics

LPIPSPSNRSSIM
Stella-VSlam0.484419.790.6909
Rig-based-SfM0.394822.480.7465
Rig-based-SfM-Depth-Priors0.393122.650.7469

Detailed Description

Dataset Collection

We initially explored using existing 360° datasets such as KITTI-360 and Matterport3D. However, neither dataset was suitable for our pipeline:

  • Matterport3D does not provide the temporal image sequences required for our optimization pipeline.
  • KITTI-360 is designed for multi-camera fusion; stitching its dual fisheye cameras introduces missing regions above the vehicle due to camera placement.

As a result, we collected our own dataset using an Insta360 X4 camera. Videos were recorded at 5.7K and 8K resolutions during outdoor walking trajectories. Frames are temporally sampled from the videos and passed into the pipeline. Based on empirical analysis, we sample one frame per second, aligned with keyframes produced by the baseline Stella-VSLAM system.

The final dataset consists of 8 video walks, each 3–8 minutes long, captured both with and without loop closure. This enables evaluation of global bundle adjustment behavior across all three pipelines.

Evaluation Protocol

Baseline: Stella-VSLAM

We directly use camera poses output by Stella-VSLAM to train Gaussian Splatting and evaluate reconstruction quality on the held-out test split.
To account for exposure and lighting variations, image–embedding–based appearance modeling is enabled during GS training.

Rig-based SfM (No Depth Priors)

The same video sequences are processed using our rig-based SfM pipeline, running bundle adjustment only, without scale-aligned monocular depth estimation.
Reconstruction quality is evaluated using the same metrics.

Full Pipeline (End-to-End)

We run the complete system with all preprocessing and optimization components enabled:

  • Rig-based bundle adjustment
  • Scale-aligned monocular depth estimation
  • GS training enhancements (image embeddings and depth priors)

Final reconstruction quality is evaluated on the same held-out dataset.

Results and Analysis

LPIPS

Rig-based SfM methods significantly outperform the Stella-VSLAM baseline in perceptual similarity.
Both rig-based variants achieve very similar performance, with the depth-prior pipeline achieving slightly lower LPIPS values, indicating improved perceptual fidelity.

PSNR

The depth-prior pipeline achieves the highest PSNR, with improved stability during the middle stages of GS training (≈10k iterations).
We observe a drop in PSNR after ~15k iterations for both rig-based SfM pipelines, likely due to overfitting to view-dependent effects such as specular highlights and shading changes. This behavior can be mitigated by adjusting learning rates or reducing training iterations.

SSIM

SSIM scores for both rig-based SfM pipelines are nearly identical and substantially higher than those of the Stella-VSLAM baseline, indicating improved structural consistency in the reconstructions.

Analysis

Rig-based SfM significantly improves Gaussian Splatting reconstruction quality over raw Stella-VSLAM poses, yielding large gains across PSNR, SSIM, and LPIPS. Incorporating depth priors further provides consistent, albeit modest, improvements, particularly in perceptual quality, without introducing metric regressions. These results highlight that pose accuracy is the dominant factor in GS performance, while depth priors act as a stabilizing signal that enhances perceptual fidelity under challenging capture conditions.