Method

In our experiments, we adopt VGGT as the representative 3D foundation model. For the manipulation policy, we use 3D Diffusion Policy [1], which operates on point cloud data and is trained via behavior cloning.

We explore two ways of incorporating information from 3D foundation models:

Explicitly: Using the generated points clouds from 3D foundation models

  • 3D diffusion Policy (DP3) uses ground truth point cloud for simulated data and depth cameras to get point clouds for real world data.
  • We modify this setup by integrating VGGT, which can utilize multi-view RGB images to generate a point cloud representation of the scene.
  • The resulting point cloud is subsequently fed into the DP3 point cloud encoder.
  • The encoder produces a compact 3D representation, which serves as input to the manipulation policy.

Implicitly: Using extracted features from 3D foundation models

  • Instead of using a point cloud as input to the DP3 point cloud encoder to generate a compact 3D representation, we use features extracted from VGGT.
  • We experiment with various bottlenecking strategies to downsample the VGGT features into a compact 3D representation.

References

[1] Ze, Yanjie, et al. “3d diffusion policy: Generalizable visuomotor policy learning via simple 3d representations.” arXiv preprint arXiv:2403.03954 (2024).