Method - Improving Robot Manipulation with 3D Vision Models

Leveraging Geometric Foundation Models (VGGT) for Robotic Manipulation

In our experiments, we adopt VGGT as the representative 3D foundation model. For the manipulation policy, we use 3D Diffusion Policy [1], which operates on point cloud data and is trained via behavior cloning.

We explore two ways of incorporating information from 3D foundation models:

Explicitly: Using the generated points clouds from 3D foundation models

3D diffusion Policy (DP3) uses ground truth point cloud for simulated data and depth cameras to get point clouds for real world data.
We modify this setup by integrating VGGT, which can utilize multi-view RGB images to generate a point cloud representation of the scene.
The resulting point cloud is subsequently fed into the DP3 point cloud encoder.
The encoder produces a compact 3D representation, which serves as input to the manipulation policy.

Implicitly: Using extracted features from 3D foundation models

Instead of using a point cloud as input to the DP3 point cloud encoder to generate a compact 3D representation, we use features extracted from VGGT.
We experiment with various bottlenecking strategies to downsample the VGGT features into a compact 3D representation.

Improving Grasping with Learning-based Shape Completion Networks

EconomicGrasp [2] is a system for 6-DOF grasp detection that takes a 3D point cloud of a scene—typically obtained from a single RGB-D frame—and predicts feasible grasp poses in 3D space, including the position, orientation, and grasp quality. The method replaces traditional dense grasp supervision with an economic supervision strategy, selecting only a compact set of unambiguous grasp labels. A focal representation module and an interactive grasp head further refine these candidates, enabling the model to output accurate, high-quality 6-DOF grasps with significantly reduced training and memory cost compared to prior approaches.

Our Modification
In our work, we adapt the EconomicGrasp framework to operate on RaySt3r-completed point clouds rather than raw single-view RGB-D point clouds. RaySt3r provides a more complete reconstruction of the underlying object by performing zero-shot 3D shape completion, allowing the grasp planner to reason over occluded regions and a more accurate object geometry.

To integrate these richer inputs, we:
– Replace the standard input pipeline with RaySt3r-generated completed point clouds.
– Fine-tune EconomicGrasp on these completed reconstructions so the grasp prediction network learns to exploit RaySt3r’s improved geometric fidelity.

References

[1] Ze, Yanjie, et al. “3d diffusion policy: Generalizable visuomotor policy learning via simple 3d representations.” arXiv preprint arXiv:2403.03954 (2024).
[2] Wu, Xiao-Ming, et al. “An economic framework for 6-dof grasp detection.” ECCV 2024.