In our experiments, we adopt VGGT as the representative 3D foundation model. For the manipulation policy, we use 3D Diffusion Policy [1], which operates on point cloud data and is trained via behavior cloning.
We explore two ways of incorporating information from 3D foundation models:
Explicitly: Using the generated points clouds from 3D foundation models

- 3D diffusion Policy (DP3) uses ground truth point cloud for simulated data and depth cameras to get point clouds for real world data.
- We modify this setup by integrating VGGT, which can utilize multi-view RGB images to generate a point cloud representation of the scene.
- The resulting point cloud is subsequently fed into the DP3 point cloud encoder.
- The encoder produces a compact 3D representation, which serves as input to the manipulation policy.
Implicitly: Using extracted features from 3D foundation models

- Instead of using a point cloud as input to the DP3 point cloud encoder to generate a compact 3D representation, we use features extracted from VGGT.
- We experiment with various bottlenecking strategies to downsample the VGGT features into a compact 3D representation.
References
[1] Ze, Yanjie, et al. “3d diffusion policy: Generalizable visuomotor policy learning via simple 3d representations.” arXiv preprint arXiv:2403.03954 (2024).