Introduction

Motivation

  • Manipulation tasks benefit from accurate depth and shape understanding.
  • New 3D vision foundation models enable fast, calibration-free, multi-view 3D reconstruction using standard RGB cameras.

Problem Statements

  • Faster global alignment to achieve robust, real-time multi-view 3D reconstruction for manipulation.
  • Leverage geometric foundation models (VGGT) to provide stronger geometric features that improve scene understanding and downstream manipulation tasks.
  • Use learning-based shape completion to turn partial RGB-D observations into complete point clouds, enabling more stable and reliable grasp planning.