Background

Pointmap formulation for 3D reconstruction

The pointmap formulation reimagines 3D reconstruction as a dense prediction task, where each pixel in an image is mapped to a corresponding 3D point in space, resulting in a (H, W, 3) tensor. This dense 2D-3D correspondence framework facilitates the extraction of various geometric properties, such as depth estimation, camera calibration, and pose estimation, all from the same representation.

The DUSt3R [1] model (Dense and Unconstrained Stereo 3D Reconstruction) leverages this approach by predicting pointmaps for pairs of images, aligning both in the coordinate frame of the first image. This method eliminates the need for prior camera calibration or known poses, simplifying the 3D reconstruction pipeline.

Although the DUST3R architecture is tailored for 3D reconstruction from a single image pair, it can be scaled to multi-view settings by applying a non-linear optimization strategy to globally align multiple image pairs.

VGGT

VGGT [2] improves upon pointmap-based methods like DUSt3R by eliminating the need for expensive global alignment. Instead, it employs a streamlined pipeline with 24 pairs of frame-wise and global self-attention layers, allowing it to jointly reason across all views. VGGT also predicts dense depth, camera poses, pointmaps, and point tracks in a single forward pass. This unified design enables state-of-the-art performance in multi-view 3D reconstruction—without post-processing—using a multitask loss to supervise all outputs jointly.

Comparing 3D reconstruction methods

To support our goal of leveraging 3D reconstruction for robotic manipulation, we compared several 3D reconstruction method. Key evaluation metrics include reconstruction accuracy, inference time, temporal consistency across frames, and the ability to recover absolute depth. The results are presented in the table below.


RaySt3R – single-view RGB-D to full 3D shape completion

RaySt3R [3] is a model for zero-shot 3D object completion: it completes the 3D shape of partially observed objects by predicting what their full geometry should look like. 

  • Input: a single RGB-D image (i.e., a color image + depth map) of an object, plus a novel viewpoint (encoded as a set of “query rays”). 
  • Model: a feedforward transformer predicts — for those query rays — a depth map, object masks, and per-pixel confidence scores
  • Output: by merging (fusing) predictions across multiple query views, RaySt3R reconstructs a complete 3D shape of the object, potentially filling in geometry that was occluded in the original view.

Architecture

References

[1] Wang, Shuzhe, et al. “Dust3r: Geometric 3d vision made easy.” 
[2] Wang, Jianyuan, et al. “VGGT: Visual Geometry Grounded Transformer.”
[3] Duisterhof, Bardienus P., et al. “RaySt3R: Predicting Novel Depth Maps for Zero-Shot Object Completion.” NeurIPS 2025