Method

Leveraging Geometric Foundation Models (VGGT) for Robotic Manipulation

In our experiments, we adopt VGGT as the representative 3D foundation model. For the manipulation policy, we use 3D Diffusion Policy [1], which operates on point cloud data and is trained via behavior cloning.

We explore two ways of incorporating information from 3D foundation models:

Explicitly: Using the generated points clouds from 3D foundation models

  • 3D diffusion Policy (DP3) uses ground truth point cloud for simulated data and depth cameras to get point clouds for real world data.
  • We modify this setup by integrating VGGT, which can utilize multi-view RGB images to generate a point cloud representation of the scene.
  • The resulting point cloud is subsequently fed into the DP3 point cloud encoder.
  • The encoder produces a compact 3D representation, which serves as input to the manipulation policy.

Implicitly: Using extracted features from 3D foundation models

  • Instead of using a point cloud as input to the DP3 point cloud encoder to generate a compact 3D representation, we use features extracted from VGGT.
  • We experiment with various bottlenecking strategies to downsample the VGGT features into a compact 3D representation.

 Improving Grasping with Learning-based Shape Completion Networks


EconomicGrasp [2] is a system for 6-DOF grasp detection that takes a 3D point cloud of a scene—typically obtained from a single RGB-D frame—and predicts feasible grasp poses in 3D space, including the position, orientation, and grasp quality. The method replaces traditional dense grasp supervision with an economic supervision strategy, selecting only a compact set of unambiguous grasp labels. A focal representation module and an interactive grasp head further refine these candidates, enabling the model to output accurate, high-quality 6-DOF grasps with significantly reduced training and memory cost compared to prior approaches.


Our Modification
In our work, we adapt the EconomicGrasp framework to operate on RaySt3r-completed point clouds rather than raw single-view RGB-D point clouds. RaySt3r provides a more complete reconstruction of the underlying object by performing zero-shot 3D shape completion, allowing the grasp planner to reason over occluded regions and a more accurate object geometry.

To integrate these richer inputs, we:
– Replace the standard input pipeline with RaySt3r-generated completed point clouds.
– Fine-tune EconomicGrasp on these completed reconstructions so the grasp prediction network learns to exploit RaySt3r’s improved geometric fidelity.

References

[1] Ze, Yanjie, et al. “3d diffusion policy: Generalizable visuomotor policy learning via simple 3d representations.” arXiv preprint arXiv:2403.03954 (2024).
[2] Wu, Xiao-Ming, et al. “An economic framework for 6-dof grasp detection.” ECCV 2024.