Results & Conclusion

Results & Conclusion

Bridging the Sim2Real Gap with Generative Models

Given that our point-map regressor uses a frozen SD-DINO encoder as its backbone, one plausible explanation for its ability to generalize to real humans is that the encoder alone provides strong domain robustness. To assess the importance of the synthetic data generation pipeline proposed, we compare two point-map regressors evaluated on RGB images of humans from the MoVi dataset. The first model is trained solely on robot-rendered images, while the second is trained on our synthetically generated human images described in our data generation pipeline.

As illustrated in the figure above, our results reveal a clear contrast:

the regressor trained on robot imagery fails to generalize to real humans, whereas the regressor trained on synthetic human images achieves substantially better performance. This indicates that the synthetic data generation pipeline is essential for bridging the domain gap and enabling effective transfer to real human imagery.

Generalization to Diverse Body Proportions

We assess our model’s ability to handle extreme body proportions by evaluating it on an unusually tall basketball player whose limb lengths and overall stature lie far outside the distribution of typical training subjects, as demonstrated in the figure above. Despite this substantial deviation from canonical human shape, the model continues to produce stable, multi-layer 3D point correspondences for every pixel. Notably, these predicted correspondences remain metrically consistent and align well to a fixed-size humanoid robot without requiring explicit re-scaling or additional optimization. This indicates that our feedforward pipeline effectively bridges the embodiment gap even when confronted with highly stretched limb ratios. While this case study examines a single extreme example, it highlights the model’s ability to extract shape-invariant geometric cues, suggesting promising generalization to a broader range of atypical body types.

In the Wild Example

We also evaluated our pipeline on an in-the-wild video captured using an iPhone. In this example, a volunteer classmate performs a short dance sequence, and the resulting humanoid motion closely follows the subject’s movements. This result demonstrates the robustness of our approach under real-world conditions.

Baseline Comparison

This video demonstrates our results on a real human sequence, with comparisons against the ground truth and two baseline methods. The first baseline predicts joint angles directly from 2D images, while the second applies HMR 2.0 followed by motion retargeting. These comparisons highlight the advantages of our approach in capturing accurate and stable humanoid motion.

Finally, we present quantitative comparisons with two baseline methods. Our approach achieves lower joint-position and joint-angle errors than the 2D image–only baseline, highlighting the importance of explicit 3D reasoning. Compared to HMR 2.0 with retargeting, our method delivers comparable performance while requiring significantly less training data, demonstrating both efficiency and effectiveness.

2D ≠ 3D Understanding

The naive baseline of a pure image models struggled with joint-angle prediction due to missing 3D context and kinematic structure and the model ended up predicting the mean pose.

Bridges Sim-to-Real

Leveraging generative models in synthetic data generation helped bridge the sim-to-real gap.

Adapts Across Morphologies

Even though trained images generated from a fixed-size robot, the method transferred effectively to humans with varying body shapes.