Background

Pointmap formulation for 3D reconstruction

The pointmap formulation reimagines 3D reconstruction as a dense prediction task, where each pixel in an image is mapped to a corresponding 3D point in space, resulting in a (H, W, 3) tensor. This dense 2D-3D correspondence framework facilitates the extraction of various geometric properties, such as depth estimation, camera calibration, and pose estimation, all from the same representation.

The DUSt3R [1] model (Dense and Unconstrained Stereo 3D Reconstruction) leverages this approach by predicting pointmaps for pairs of images, aligning both in the coordinate frame of the first image. This method eliminates the need for prior camera calibration or known poses, simplifying the 3D reconstruction pipeline.

Although the DUST3R architecture is tailored for 3D reconstruction from a single image pair, it can be scaled to multi-view settings by applying a non-linear optimization strategy to globally align multiple image pairs.

VGGT

VGGT [2] improves upon pointmap-based methods like DUSt3R by eliminating the need for expensive global alignment. Instead, it employs a streamlined pipeline with 24 pairs of frame-wise and global self-attention layers, allowing it to jointly reason across all views. VGGT also predicts dense depth, camera poses, pointmaps, and point tracks in a single forward pass. This unified design enables state-of-the-art performance in multi-view 3D reconstruction—without post-processing—using a multitask loss to supervise all outputs jointly.

Comparing 3D reconstruction methods

To support our goal of leveraging 3D reconstruction for robotic manipulation, we compared several 3D reconstruction method. Key evaluation metrics include reconstruction accuracy, inference time, temporal consistency across frames, and the ability to recover absolute depth. The results are presented in the table below.

References

[1] Wang, Shuzhe, et al. “Dust3r: Geometric 3d vision made easy.” 
[2] Wang, Jianyuan, et al. “VGGT: Visual Geometry Grounded Transformer.”