We adopt the environmental setup and evaluation protocol used in Diffusion Policy (DP3). We utilize the Adroit simulation environment, built on MuJoCo, to evaluate the proposed method. Adroit provides a set of dexterous manipulation tasks involving a 24-DoF anthropomorphic robotic hand, making it an good benchmark for fine-grained visuomotor control and imitation learning. Specifically, we focus on the Adroit Hammer task, in which the hand must grasp a hammer and drive a nail into a wooden block
In the experiments below, we compare different bottle-necking and polling strategies to downsample features to a compact 3D representation. We also experiment on which layers of the VGGT model would be most suitable for a manipulation task. Further to demonstrate the advantage of VGGT 3D features by comparing against DINOv2 features.
MLP Bottleneck
We compare the mean success rates when using VGGT features from layer 6, layer 12 and layer 18 when using an MLP bottleneck for downsampling. The best performance was achieved with features from layer 6, indicating that earlier layers yield better results.

Conv Bottleneck
We compare the mean success rates when using VGGT features from layer 6, layer 12 and layer 18 when using an Conv bottleneck for downsampling. Similar to the MLP bottleneck above, best performance was achieved with features from layer 6, indicating that earlier layers yield better results.
When comparing MLP and Conv bottlenecks, the MLP strategy consistently yields better results.

Polling Strategies
To reduce the dimensionality of VGGT features, we also explored various pooling strategies. Our results showed that using a linear layer without any pooling achieved the best performance, suggesting that the information loss introduced by pooling negatively impacts results.

3D features (VGGT) vs 2D features (DINOv2)
To assess the benefits of a 3D-aware model like VGGT, we compare it against strong image feature representations from DINOv2. VGGT shows a slight performance gain, suggesting that 3D-aware features can offer valuable advantages for manipulation tasks.

Preliminary Quantitative Results: Adroit Hammer Task
The table below presents the mean success rate on the Adroit Hammer task. Methods using the PC input representation rely on ground-truth point clouds. Despite operating with less information (i.e., RGB input), our method—leveraging VGGT features—achieves performance nearly on par with the best-performing model, the vanilla DP3 encoder that utilizes ground-truth point clouds.
